# A niftier framework for the statistical ensembles

The way we have obtained the canonical and grand canonical partition functions from the microcanonical ensemble in The canonical ensemble and The grand canonical ensemble is a rather "classic" one, and maybe also the most intuitive.

However this is not the only possible one: in fact, as we now want to show it is possible to obtain all the ensembles (including the microcanonical one) from the principle of maximum entropy, where the entropy is defined in the most general way, i.e. :

${\displaystyle S=-k\operatorname {Tr} (\rho \ln \rho )}$

In other words what we want to see is that maximizing the entropy of a system as written now with appropriately chosen constraints it is possible to determine both the canonical and grand canonical ensembles.

Let us consider a very different but simple and intuitive example to understand how this can be possible. Suppose we have a normal six-sided die; if we know nothing about it (namely we don't know if it has been fixed or not) then all the possible rolls have the same probability, i.e. ${\displaystyle p_{i}=1/6}$ for ${\displaystyle i=1,\dots ,6}$. This fact can also be obtained from the maximization of Shannon's entropy (we remove any proportionality constant for simplicity):

${\displaystyle S=-\sum _{i=1}^{6}p_{i}\ln p_{i}}$
with the constraint ${\textstyle \sum _{i}p_{i}=1}$. In fact (as it must be done for constrained optimization problems like this one) the maximization of ${\displaystyle S}$ leads to:
{\displaystyle {\begin{aligned}{\frac {\partial }{\partial p_{j}}}\left[-\sum _{i}p_{i}\ln p_{i}-\lambda \left(\sum _{i}p_{i}-1\right)\right]=-1-\ln p_{j}-\lambda =0\quad \Rightarrow \\\Rightarrow \quad p_{j}=e^{-\lambda -1}:=c\end{aligned}}}
where we have simply relabelled the constant in the last step (note that ${\displaystyle p_{j}}$doesn't depend on ${\displaystyle j}$); therefore from ${\textstyle \sum _{i}p_{i}=1}$ we have exactly ${\displaystyle p_{i}=1/6}$.

Now, suppose that the die has been fixed and that ${\displaystyle p_{1}=2p_{6}}$; in order to find the new probabilities ${\displaystyle p_{i}}$ we now have to maximize ${\displaystyle S}$ with the additional constraint ${\displaystyle p_{1}=2p_{6}}$. Therefore:

{\displaystyle {\begin{aligned}{\frac {\partial }{\partial j}}\left[-\sum _{i}p_{i}\ln p_{i}-\lambda \left(\sum _{i}p_{i}-1\right)-\mu \left(p_{1}-2p_{6}\right)\right]=0\qquad \Rightarrow \\\Rightarrow \qquad {\begin{cases}-1-\ln p_{j}-\lambda =0&j\neq 1,6\\-1-\ln p_{1}-\lambda -\mu =0&j=1\\-1-\ln p_{6}-\lambda +2\mu =0&j=6\end{cases}}\qquad \Rightarrow \\\Rightarrow \qquad {\begin{cases}p_{i}=e^{-\lambda -1}\equiv 1/Z&i\neq 1,6\\p_{1}=e^{-\lambda -1-\mu }\\p_{6}=e^{-\lambda -1+2\mu }\end{cases}}\qquad \Rightarrow \\\qquad \Rightarrow \qquad {\begin{cases}p_{i}=1/Z&i\neq 1,6\\p_{1}=e^{-\mu }/Z\\p_{6}=e^{2\mu }/Z\end{cases}}\end{aligned}}}
and requiring that ${\displaystyle p_{1}=2p_{6}}$ and ${\textstyle \sum _{i}p_{i}=1}$ we have ${\displaystyle \mu =-\ln 2/3}$ and ${\displaystyle Z=4+e^{-\mu }+e^{2\mu }}$. Explicitly:
${\displaystyle p_{i}=0.170\quad i\neq 1,6\quad \qquad p_{1}=0.214\quad \qquad p_{6}=0.107}$
So we see that we indeed managed to reconstruct all the probability distribution of the system only from the maximization of its entropy, with the appropriate constraints.

Let us now see this more in general: suppose we have a system which can be found in ${\displaystyle \Omega }$ different states (for simplicity we now consider the discrete case), each with probability ${\displaystyle p_{i}}$. Let us also suppose that we have put some contraints on the mean values of some observables ${\displaystyle O^{(i)}}$ defined on this system, i.e.:

{\displaystyle {\begin{aligned}\left\langle O^{(0)}\right\rangle =\sum _{i}p_{i}=1={\overline {O}}^{(0)}\\\left\langle O^{(1)}\right\rangle =\sum _{i}p_{i}O_{i}^{(1)}={\overline {O}}^{(1)}\quad \cdots \quad \left\langle O^{(\alpha )}\right\rangle =\sum _{i}p_{i}O_{i}^{(\alpha )}={\overline {O}}^{(\alpha )}\end{aligned}}}
where ${\displaystyle O_{j}^{(\delta )}}$ are some functions depending on ${\displaystyle j}$ (considering the previous example of the die, with our notation we have ${\displaystyle O_{i}{}^{(1)}=2\delta _{i1}-\delta _{i6}}$) and ${\displaystyle {\overline {O}}{}^{(\delta )}}$ are some given values of the observables. We have put also the normalization condition in the same form as the other constraints (with ${\displaystyle O_{i}{}^{(0)}=1}$) in order to have a more general notation. As we know the entropy of the system will be given by (we again drop any constant in front of ${\displaystyle S}$):
${\displaystyle S=-\sum _{i}p_{i}\ln p_{i}}$
Let us therefore see what happens if we maximize ${\displaystyle S}$ with the constraints . What we have to find is:
${\displaystyle \max _{{\vec {p}}({\vec {\lambda }})}\left[S-\lambda _{0}\left(\left\langle O^{(0)}\right\rangle -{\overline {O}}^{(0)}\right)-\lambda _{1}\left(\left\langle O^{(1)}\right\rangle -{\overline {O}}^{(1)}\right)-\cdots -\lambda _{\alpha }\left(\left\langle O^{(\alpha )}\right\rangle -{\overline {O}}^{(\alpha )}\right)\right]}$
where ${\displaystyle {\vec {p}}({\vec {\lambda }})}$ is a short notation to indicate the set of the probabilities ${\displaystyle p_{i}}$ seen as functions of ${\displaystyle \lambda _{j}}$. Therefore:
{\displaystyle {\begin{aligned}{\frac {\partial }{\partial p_{j}}}\left[S-\lambda _{0}\left(\left\langle O^{(0)}\right\rangle -{\overline {O}}^{(0)}\right)-\lambda _{1}\left(\left\langle O^{(1)}\right\rangle -{\overline {O}}^{(1)}\right)-\cdots \right.\\\left.\cdots -\lambda _{\alpha }\left(\left\langle O^{(\alpha )}\right\rangle -{\overline {O}}^{(\alpha )}\right)\right]=\\=-1-\ln p_{j}-\lambda _{0}O_{j}^{(0)}-\lambda _{1}O_{j}^{(1)}-\cdots -\lambda _{\alpha }O_{j}^{(\alpha )}=0\end{aligned}}}
${\displaystyle \Rightarrow \quad p_{j}=e^{-1-\lambda _{0}-\lambda _{1}O_{j}^{(1)}-\cdots -\lambda _{\alpha }O_{j}^{(\alpha )}}}$
From the normalization condition we have:
${\displaystyle 1=\sum _{j}p_{j}=e^{-1-\lambda _{0}}\sum _{j}e^{-\lambda _{1}O_{j}^{(1)}-\cdots -\lambda _{\alpha }O_{j}^{(\alpha )}}}$
If we define:
${\displaystyle Z:=\sum _{j}e^{-\lambda _{1}O_{j}^{(1)}-\cdots -\lambda _{\alpha }O_{j}^{(\alpha )}}}$
then:
${\displaystyle Z=e^{1+\lambda _{0}}\quad \Rightarrow \quad p_{j}={\frac {1}{Z}}e^{-\sum _{\delta }\lambda _{\delta }O_{j}^{(\delta )}}}$
which has a very familiar form (the one of the canonical and grand canonical probability densities).

Now, in order to solve the problem we still have to impose all the other constraints: ${\textstyle \left\langle O^{(\gamma )}\right\rangle ={\overline {O}}{}^{(\gamma )}}$. These can be written as:

${\displaystyle \left\langle O^{(\gamma )}\right\rangle =\sum _{j}p_{j}O_{j}^{(\gamma )}={\frac {1}{Z}}\sum _{j}e^{-\sum _{\delta }\lambda _{\delta }O_{j}^{(\delta )}}O_{j}^{(\gamma )}={\frac {\sum _{j}e^{-\sum _{\delta }\lambda _{\delta }O_{j}^{(\delta )}}O_{j}^{(\gamma )}}{\sum _{j}e^{-\sum _{\delta }\lambda _{\delta }O_{j}^{(\delta )}}}}}$
From the definition of ${\displaystyle Z}$ we see that:
${\displaystyle \left\langle O^{(\gamma )}\right\rangle =-{\frac {1}{Z}}{\frac {\partial Z}{\partial \lambda _{\gamma }}}=-{\frac {\partial \ln Z}{\partial \lambda _{\gamma }}}}$
which has exactly the same form as the equation that defines the mean value of the energy in the canonical ensemble, for example. Therefore the values ${\displaystyle {\overline {\lambda }}_{\delta }}$ of the parameters ${\displaystyle \lambda _{\delta }}$which maximize ${\displaystyle S}$ with the constraints are the solutions of the equations:
${\displaystyle {\overline {O}}^{(1)}=-{\frac {\partial \ln Z}{\partial \lambda _{1}}}_{|{\overline {\lambda }}_{1}}\qquad \cdots \qquad {\overline {O}}^{(\alpha )}=-{\frac {\partial \ln Z}{\partial \lambda _{\alpha }}}_{|{\overline {\lambda }}_{\alpha }}}$
These equations are in general very difficult to solve analytically but there is a rather simple method, which we now briefly see, that allows one to determine the parameters ${\displaystyle {\overline {\lambda }}_{\delta }}$ numerically. Let us begin noting that:
{\displaystyle {\begin{aligned}{\frac {\partial ^{2}\ln Z}{\partial \lambda _{\alpha }\partial \lambda _{\gamma }}}={\frac {\partial }{\partial \lambda _{\alpha }}}\left({\frac {1}{Z}}{\frac {\partial Z}{\partial \lambda _{\gamma }}}\right)={\frac {\partial }{\partial \lambda _{\alpha }}}\left\langle O^{(\gamma )}\right\rangle ={\frac {\partial }{\partial \lambda _{\alpha }}}\left({\frac {\sum _{j}e^{-\sum _{\delta }\lambda _{\delta }O_{j}^{(\delta )}}O_{j}^{(\gamma )}}{\sum _{j}e^{-\sum _{\delta }\lambda _{\delta }O_{j}^{(\delta )}}}}\right)=\\={\frac {\sum _{j}e^{-\sum _{\delta }\lambda _{\delta }O_{j}^{(\delta )}}O_{j}^{(\gamma )}O_{j}^{(\alpha )}}{\sum _{j}e^{-\sum _{\delta }\lambda _{\delta }O_{j}^{(\delta )}}}}-{\frac {\sum _{j}e^{-\sum _{\delta }\lambda _{\delta }O_{j}^{(\delta )}}O_{j}^{(\alpha )}}{Z^{2}}}\sum _{i}e^{-\sum _{\delta }\lambda _{\delta }O_{i}^{(\delta )}}O_{i}^{(\gamma )}=\\=\left\langle O^{(\alpha )}O^{(\gamma )}\right\rangle -\left\langle O^{(\alpha )}\right\rangle \left\langle O^{(\gamma )}\right\rangle =\left\langle \left(O^{(\alpha )}-\left\langle O^{(\alpha )}\right\rangle \right)\left(O^{(\gamma )}-\left\langle O^{(\gamma )}\right\rangle \right)\right\rangle \end{aligned}}}
where the last term is the mean value of the product of two fluctuations: this is the covariance of ${\displaystyle O{}^{(\alpha )}}$ and ${\displaystyle O{}^{(\gamma )}}$. In general, the ${\displaystyle (\alpha ,\gamma )}$-th element of the covariance matrix ${\displaystyle C}$is exactly defined as the covariance between ${\displaystyle O{}^{(\alpha )}}$ and ${\displaystyle O{}^{(\gamma )}}$:
${\displaystyle C^{\alpha \gamma }:=\left\langle \left(O^{(\alpha )}-\left\langle O^{(\alpha )}\right\rangle \right)\left(O^{(\gamma )}-\left\langle O^{(\gamma )}\right\rangle \right)\right\rangle }$
Therefore, we have that:
${\displaystyle {\frac {\partial ^{2}\ln Z}{\partial \lambda _{\alpha }\partial \lambda _{\gamma }}}=C^{\alpha \gamma }}$
The covariance matrix is positive (semi)definite; in fact if ${\displaystyle {\vec {x}}}$ is a generic vector, then:
${\displaystyle {\vec {x}}\cdot C{\vec {x}}=\sum _{\alpha ,\gamma }C^{\alpha \gamma }x_{\alpha }x_{\gamma }=\left\langle \left[\sum _{\delta }x_{\delta }\left(O^{(\delta )}-\left\langle O^{(\delta )}\right\rangle \right)\right]^{2}\right\rangle \geq 0}$
We needed these observations because if we now define the function:
${\displaystyle F=\ln Z+\sum _{\delta =1}^{\alpha }\lambda _{\delta }{\overline {O}}^{(\delta )}}$
then since ${\displaystyle -\partial \ln Z/{\partial \lambda _{\delta }}_{|{\overline {\lambda }}_{\delta }}={\overline {O}}{}^{(\delta )}}$ we have that:
${\displaystyle {\frac {\partial F}{\partial \lambda _{\delta }}}_{|{\overline {\lambda }}_{\delta }}=0}$
i.e. ${\displaystyle F}$ has an extremum in ${\displaystyle {\overline {\lambda }}_{\delta }}$. However:
${\displaystyle {\frac {\partial ^{2}F}{\partial \lambda _{\alpha }\partial \lambda _{\gamma }}}_{|{\overline {\lambda }}_{\alpha },{\overline {\lambda }}_{\gamma }}=C^{\alpha \gamma }\geq 0}$
and so this extremum is a minimum: ${\displaystyle F}$ is minimized by the values ${\displaystyle {\overline {\lambda }}_{\delta }}$ of the parameters ${\displaystyle \lambda _{\delta }}$ which maximize the entropy ${\displaystyle S}$ of the system. Therefore, in this way we can simply determine the values ${\displaystyle {\overline {\lambda }}_{\delta }}$ by findind the minima of ${\displaystyle F}$, which is a rather straightforward computational problem.

Let us now briefly see what happens in the continuous case, so that we can use what we have seen in the framework of ensemble theory. Since we are now dealing with continuous probability densities ${\displaystyle \rho }$, they will not depend on the discrete index ${\displaystyle i}$ but on the "continuous index" ${\displaystyle (\mathbb {Q} ,\mathbb {P} )}$, and of course the summations over ${\displaystyle i}$ must be replaced with integrations in phase space. In other words, the entropy of the system will be:

${\displaystyle S=-\int d\Gamma \rho (\mathbb {Q} ,\mathbb {P} )\ln \rho (\mathbb {Q} ,\mathbb {P} )}$
and the contraints are:

The probability density will be of the form:

${\displaystyle \rho (\mathbb {Q} ,\mathbb {P} )={\frac {1}{Z}}e^{-\sum _{\delta }\lambda _{\delta }O^{(\delta )}(\mathbb {Q} ,\mathbb {P} )}\quad \qquad Z=\int e^{-\sum _{\delta }\lambda _{\delta }O^{(\delta )}(\mathbb {Q} ,\mathbb {P} )}d\Gamma }$
and the values ${\displaystyle {\overline {\lambda }}_{\delta }}$ of the parameters ${\displaystyle \lambda _{\delta }}$ which maximize ${\displaystyle S}$ will be again the solutions of the equations:
${\displaystyle {\overline {O}}^{(1)}=-{\frac {\partial \ln Z}{\partial \lambda _{1}}}_{|{\overline {\lambda }}_{1}}\qquad \cdots \qquad {\overline {O}}^{(\alpha )}=-{\frac {\partial \ln Z}{\partial \lambda _{\alpha }}}_{|{\overline {\lambda }}_{\alpha }}}$

Let us now apply all this to the ensemble theory. In the microcanonical ensemble we only have the normalization constraint:

${\displaystyle \int \rho (\mathbb {Q} ,\mathbb {P} )d\Gamma =1}$
where the integration is done over the ${\displaystyle (\mathbb {Q} ,\mathbb {P} )}$ points that satisfy ${\displaystyle {\mathcal {H}}(\mathbb {Q} ,\mathbb {P} )=E}$, with ${\displaystyle {\mathcal {H}}}$ the Hamiltonian of the system and ${\displaystyle E}$ a given value of the energy. In this case, therefore, the only non-null "observable" is ${\displaystyle O^{(0)}}$, which as we have seen is a "fictious" one (defined so that also the normalization condition can be put in the form of a constraint on the mean value of a given observable). In other words, referring to our notation we have ${\displaystyle \alpha =0}$ and the probability density has indeed the form:
${\displaystyle \rho (\mathbb {Q} ,\mathbb {P} )={\frac {1}{\Omega (E)}}={\text{const.}}}$
where we have called ${\displaystyle \Omega (E)}$ the normalization factor. The value of ${\displaystyle Z}$ can be obtained intuitively as we have done in The microcanonical ensemble, i.e. since ${\displaystyle \rho }$ must be zero everything but on the phase space hypersurface of constant energy ${\displaystyle E}$, whose volume is ${\displaystyle \Omega (E)}$, then:
${\displaystyle \rho (\mathbb {Q} ,\mathbb {P} )={\frac {\delta ({\mathcal {H}}(\mathbb {Q} ,\mathbb {P} )-E)}{\Omega (E)}}}$
In the canonical ensemble we have a new constraint, i.e. we require the mean value of the energy to be fixed:
${\displaystyle \left\langle {\mathcal {H}}\right\rangle =\int \rho (\mathbb {Q} ,\mathbb {P} ){\mathcal {H}}(\mathbb {Q} ,\mathbb {P} )d\Gamma ={\overline {E}}}$
With our previous notation we have ${\displaystyle \alpha =1}$ and ${\displaystyle O^{(1)}(\mathbb {Q} ,\mathbb {P} )={\mathcal {H}}(\mathbb {Q} ,\mathbb {P} )}$, so that:
${\displaystyle \rho (\mathbb {Q} ,\mathbb {P} )={\frac {1}{Z}}e^{-\lambda _{1}{\mathcal {H}}(\mathbb {Q} ,\mathbb {P} )}\quad \qquad Z=\int e^{-\lambda _{1}{\mathcal {H}}(\mathbb {Q} ,\mathbb {P} )}d\Gamma }$
where[1]${\displaystyle \lambda _{1}=\beta }$.

In the grand canonical ensemble, then, we have the additional constraint of having the mean value of the number of particles fixed, namely ${\displaystyle O^{(2)}=N}$. Explicitly we have that the entropy of the system is:

${\displaystyle S=-\sum _{N}\int d\Gamma _{N}\rho _{N}(\mathbb {Q} ,\mathbb {P} )\ln \rho _{N}(\mathbb {Q} ,\mathbb {P} )}$
and the constraints are:
${\displaystyle \sum _{N}\int \rho _{N}(\mathbb {Q} ,\mathbb {P} )d\Gamma _{N}=1\quad \qquad \sum _{N}\int \rho _{N}(\mathbb {Q} ,\mathbb {P} ){\mathcal {H}}_{N}(\mathbb {Q} ,\mathbb {P} )d\Gamma _{N}={\overline {E}}}$
${\displaystyle \sum _{N}\int \rho _{N}(\mathbb {Q} ,\mathbb {P} )Nd\Gamma _{N}={\overline {N}}}$
In this case, we will have:
${\displaystyle \rho _{N}(\mathbb {Q} ,\mathbb {P} )={\frac {1}{\mathcal {Z}}}e^{-\lambda _{1}{\mathcal {H}}_{N}(\mathbb {Q} ,\mathbb {P} )-\lambda _{2}N}\quad \qquad {\mathcal {Z}}=\sum _{N}\int e^{-\lambda _{1}{\mathcal {H}}_{N}(\mathbb {Q} ,\mathbb {P} )-\lambda _{2}N}d\Gamma _{N}}$
where ${\displaystyle \lambda _{1}=\beta }$ and ${\displaystyle \lambda _{2}=-\mu \beta }$.

We conclude with an observation. We have determined the properties of the ensembles fixing the values of the first moments of the observables (i.e., ${\displaystyle {\mathcal {H}}}$ and ${\displaystyle N}$); we can ask: why haven't we fixed also other moments (namely ${\displaystyle {\mathcal {H}}^{2}}$, ${\displaystyle N^{2}}$ etc.)? In general it can happen that those additional moments are redundant; let us see a simple example in order to understand that. Suppose ${\displaystyle x}$ is a stochastic variable distributed along a probability distribution ${\displaystyle \varrho (x)}$; imagine that we are given ${\displaystyle \Omega }$ values ${\displaystyle x_{i}}$ of ${\displaystyle x}$ without knowing ${\displaystyle \varrho (x)}$ and that we want to understand what ${\displaystyle \varrho (x)}$ is from the ${\displaystyle x_{i}}$-s. How can we proceed? We could try to guess ${\displaystyle \varrho }$ with a procedure similar to what we have seen now. For example we could compute the ${\displaystyle n}$-th moments of ${\displaystyle x}$ with ${\displaystyle n=1,2,3}$, namely ${\displaystyle \left\langle x\right\rangle }$, ${\displaystyle \left\langle x^{2}\right\rangle }$ and ${\displaystyle \left\langle x^{3}\right\rangle }$. Then, our guess for ${\displaystyle \varrho }$ would be:

${\displaystyle \varrho (x)={\frac {1}{Z}}e^{-\lambda _{1}x-\lambda _{2}x^{2}-\lambda _{3}x^{3}}\quad \qquad Z=\int e^{-\lambda _{1}x-\lambda _{2}x^{2}-\lambda _{3}x^{3}}dx}$
and the values ${\displaystyle {\overline {\lambda }}_{i}}$ of ${\displaystyle \lambda _{i}}$ which give the correct expression of ${\displaystyle \varrho }$ are given by the solutions of:
${\displaystyle {\frac {\partial \ln Z}{\partial \lambda _{n}}}_{|{\overline {\lambda }}_{n}}={\overline {O}}^{(n)}}$
where ${\displaystyle {\overline {O}}{}^{(n)}}$ is computed from the given set of ${\displaystyle x_{i}}$:
${\displaystyle {\overline {O}}^{(n)}={\frac {1}{\Omega }}\sum _{i}x_{i}^{n}}$
If we determine ${\displaystyle {\overline {\lambda }}_{i}}$ with and increasing number ${\displaystyle \Omega }$ of data, we expect (or at least hope) that the parameters ${\displaystyle \lambda _{i}}$ will tend to some definite values; what happens is that they often tend to zero, when ${\displaystyle i\geq 2}$. For example, if in reality ${\displaystyle \varrho (x)=e^{-x/x_{0}}/x_{0}}$ for ${\displaystyle x\geq 0}$ then repeating the computations with higher values of ${\displaystyle \Omega }$ we would find ${\displaystyle \lambda _{1}\to 1/x_{0}}$ and ${\displaystyle \lambda _{2},\lambda _{3}\to 0}$: the second and third moments of ${\displaystyle x}$ are useless if we want to determine ${\displaystyle \varrho (x)}$. Let us note, however, that the use of the moments of ${\displaystyle x}$ can be useless in some cases: if in fact ${\displaystyle \varrho (x)=1/x^{2}}$ for ${\displaystyle x\geq 1}$ then we will never be able to express it as a product of exponentials, so the parameters ${\displaystyle \lambda _{i}}$ will not tend to definite values. What can we do in this case? We can use the moments of ${\displaystyle \ln x}$ instead of ${\displaystyle x}$; in fact if we compute for example ${\textstyle \left\langle O^{(1)}\right\rangle ={\overline {O}}{}^{(1)}}$, then our guess for ${\displaystyle \varrho (x)}$ will be:
${\displaystyle \varrho (x)={\frac {1}{Z}}e^{-\lambda _{1}\ln x}={\frac {1}{Z}}\cdot {\frac {1}{x^{\lambda _{1}}}}}$
and so in this case we would expect ${\displaystyle \lambda _{1}\to 2}$; we also see that if we included come higher moments of ${\displaystyle \ln x}$, their relative parameters would have all gone to zero, and so all the ${\displaystyle n}$-th moments with ${\displaystyle n\geq 2}$ are redundant. Therefore, we see that depending on the kind probability distribution we use different "recipes"[2] in order to determine ${\displaystyle \varrho (x)}$.

However, in ensemble theory something slightly different happens. Suppose in fact that we have fixed the first two moments of ${\displaystyle {\mathcal {H}}}$ in the canonical ensemble; then we would have:

${\displaystyle \rho (\mathbb {Q} ,\mathbb {P} )={\frac {1}{Z}}e^{-\lambda _{1}{\mathcal {H}}-\lambda _{2}{\mathcal {H}}^{2}}}$
Now the energy is extensive, so ${\displaystyle {\mathcal {H}}\propto N}$ and thus the leading term in the exponential should be ${\displaystyle {\mathcal {H}}^{2}}$; in general if we fixed an arbitrary number of moments of ${\displaystyle {\mathcal {H}}}$, the leading one is the last. This, however, doesn't really make sense from a physical point of view since it would imply that the only significant contribution is that of the ${\displaystyle n}$-th moment, with ${\displaystyle n\to \infty }$.

Therefore, in the case of statistical mechanics we are (although implicitly) assuming that the moments different from the first are actually significant, since strictly speaking there is nothing that would prevent us from fixing also their values. It is the incredible accuracy of the predictions made by statistical mechanics with the experimental results that confirms that this is a reasonable assumption.
1. At this point there is no way to understand that, and we are about to see something similar also in the grand canonical ensemble. This is the disadvantage of this way of deducing the statistical ensembles: it is elegant and mathematically consistent, but not really physically intuitive. The "classical" way we have used to derive the ensembles is surely less formal and rigorous, but it allows to understand physically what happens.
2. It could be asked then what can we do if we don't know absolutely nothing about ${\displaystyle \varrho }$. In this case there is nothing that can help, besides experience; in such cases, in fact, one tries to get some insights on the problems and then try different "recipes", from which something new can be learned about ${\displaystyle \varrho (x)}$.