# Shannon's entropy

From what we have just seen we can introduce a more general definition of entropy for a generic system that can be found in ${\displaystyle \Omega }$ different discrete states[1], each with probability ${\displaystyle p_{i}}$ (and ${\displaystyle i=1,\dots ,\Omega }$). This is known as Shannon's entropy, which is defined as:

${\displaystyle S_{S}=-k_{S}\left\langle \ln p_{i}\right\rangle =-k_{S}\sum _{i=1}^{\Omega }p_{i}\ln p_{i}}$
the constant ${\displaystyle k_{S}}$ is used instead of Boltzmann's constant because in general the connection to temperature can be irrelevant, and it is defined as ${\displaystyle k_{S}=1/\ln 2}$ so that Shannon's entropy can be also rewritten as:
${\displaystyle S_{S}=-\sum _{i=1}^{\Omega }p_{i}\log _{2}p_{i}}$
The unit of measure of this entropy is the bit. Shannon's entropy is particularly useful because it is the only one that satisfies three important properties, which we now show.

Theorem

Let ${\displaystyle p_{1},\dots ,p_{\Omega }}$ be a set of stochastic variables (they are probabilities, therefore such that ${\displaystyle p_{i}\geq 0}$ and ${\displaystyle \sum _{i}p_{i}=1}$). Then Shannon's entropy ${\displaystyle S}$ (which is a function of this set of variables) as defined in satisfies the following properties:

• (Property zero) It is a continuous function

• (Property 1) ${\displaystyle S(p_{1},\dots ,p_{\Omega })}$ is maximized when all the ${\displaystyle p_{i}}$-s are uniform:

${\displaystyle S(p_{i},\dots ,p_{\Omega })

• (Property 2) ${\displaystyle S}$ is not affected by extra states of zero probability:

${\displaystyle S(p_{1},\dots ,p_{\Omega },0)=S(p_{1},\dots ,p_{\Omega })}$
(where on the left side ${\displaystyle S}$ has ${\displaystyle \Omega +1}$ arguments)

• (Property 3) ${\displaystyle S}$ changes for conditional probabilities.

Let ${\displaystyle A=\lbrace A_{i}\rbrace }$ and ${\displaystyle B=\lbrace B_{k}\rbrace }$ be two sets of events (with ${\displaystyle i=1,\dots ,\Omega }$ and ${\displaystyle k=1,\dots ,M}$), each with probability ${\displaystyle p_{i}=p(A_{i})}$ and ${\displaystyle q_{k}=p(B_{k})}$. If we define:

${\displaystyle r_{ik}=p(A_{i}B_{k})\quad \qquad c_{ik}=p(A_{i}|B_{k})={\frac {p(A_{i}B_{k})}{p(B_{k})}}={\frac {r_{ik}}{q_{k}}}}$
which, by definition, satisfy:
${\displaystyle \sum _{i=1}^{\Omega }c_{ik}=1\quad \qquad \sum _{i=1}^{\Omega }\sum _{k=1}^{M}r_{ik}=1}$
then we can introduce:
${\displaystyle S(AB)=-k_{S}\sum _{i,k}r_{ik}\ln r_{ik}\quad \qquad S(B)=-k_{S}\sum _{k=1}^{M}q_{k}\ln q_{k}}$
${\displaystyle S(A|B_{\ell })=-k_{S}\sum _{i=1}^{\Omega }c_{i\ell }\ln c_{i\ell }}$
which are, respectively, the total entropy associated to the occurrence of events ${\displaystyle A}$and${\displaystyle B}$, the entropy associated to the occurrence of events ${\displaystyle B}$ and the entropy associated to the occurrence of events ${\displaystyle A}$given${\displaystyle B_{\ell }}$. Therefore, defining:
${\displaystyle \left\langle S(A|B_{\ell })\right\rangle =\sum _{\ell }q_{\ell }S(A|B_{\ell })=-k_{S}\sum _{i,\ell }q_{\ell }c_{i\ell }\ln c_{i\ell }}$
we have that ${\displaystyle S}$ satisfies:
${\displaystyle \left\langle S(A|B_{\ell })\right\rangle =S(AB)-S(B)}$

Proof

• (Property zero) It is immediate from its definition.

• (Property 1) First of all, let us note that the function ${\displaystyle f(x):=-x\ln x}$ is concave since ${\displaystyle f''(x)=-1/x}$ (and in our case ${\displaystyle x\geq 0}$ since it represents a probability).

In general, from the definition of concave function we have:

${\displaystyle f(\lambda a+(1-\lambda )b)\geq \lambda f(a)+(1-\lambda )f(b)\qquad \qquad \lambda <1}$
From this we can prove by induction that:
${\displaystyle f\left({\frac {1}{\Omega }}\sum _{i=1}^{\Omega }p_{i}\right)\geq \sum _{i=1}^{\Omega }{\frac {1}{\Omega }}f(p_{i})}$
This is surely true for ${\displaystyle \Omega =2}$:
${\displaystyle f\left({\frac {p_{1}+p_{2}}{2}}\right)\geq {\frac {1}{2}}\left(f(p_{1})+f(p_{2})\right)}$
since it follows directly from with ${\displaystyle p_{1}=a}$, ${\displaystyle p_{2}=b}$ and ${\displaystyle \lambda =1/2}$. If we now suppose that the inequality holds for ${\displaystyle \Omega -1}$ probabilities:
${\displaystyle f\left({\frac {1}{\Omega -1}}\sum _{i=1}^{\Omega -1}p_{i}\right)\geq {\frac {1}{\Omega -1}}\sum _{i=1}^{\Omega -1}f(p_{i})}$
it follows that it holds also for ${\displaystyle \Omega }$. In fact:
${\displaystyle f\left({\frac {1}{\Omega }}\sum _{i=1}^{\Omega }p_{i}\right)=f\left({\frac {\Omega -1}{\Omega }}{\frac {1}{\Omega -1}}\sum _{i=1}^{\Omega -1}p_{i}+{\frac {p_{\Omega }}{\Omega }}\right)}$
and choosing ${\displaystyle (\Omega -1)/\Omega =\lambda }$:
{\displaystyle {\begin{aligned}f\left({\frac {1}{\Omega }}\sum _{i=1}^{\Omega }p_{i}\right)\geq {\frac {\Omega -1}{\Omega }}f\left({\frac {1}{\Omega -1}}\sum _{i=1}^{\Omega -1}p_{i}\right)+{\frac {1}{\Omega }}f(p_{\Omega })\geq \\\geq {\frac {\Omega -1}{\Omega }}{\frac {1}{\Omega -1}}\sum _{i=1}^{\Omega -1}f(p_{i})+{\frac {1}{\Omega }}f(p_{\Omega })\end{aligned}}}
Therefore:
${\displaystyle f\left({\frac {1}{\Omega }}\sum _{i=1}^{\Omega }p_{i}\right)\geq {\frac {1}{\Omega }}\sum _{i=1}^{\Omega }f(p_{i})}$

Now, considering the case ${\displaystyle f(x)=-x\ln x}$ we will have that:

{\displaystyle {\begin{aligned}S(p_{1},\dots ,p_{\Omega })=k_{S}\sum _{i=1}^{\Omega }f(p_{i})=k_{S}\Omega {\frac {1}{\Omega }}\sum _{i=1}^{\Omega }f(p_{i})
where the inequality is strict because in this case ${\displaystyle f''(x)<0}$ .

• (Property 2) This follows immediately from the fact that ${\displaystyle f(x)}$ can be continuously extended in ${\displaystyle x=0}$, so that ${\displaystyle f(0)=0}$ (as we have previously stated). Therefore if we add ${\displaystyle p_{\Omega +1}=0}$ to ${\displaystyle S}$, it will not contribute to ${\displaystyle S}$ since by definition ${\displaystyle p_{\Omega +1}\ln p_{\Omega +1}=0}$.

• (Property 3) Since ${\displaystyle r_{ik}=q_{k}c_{ik}}$, we have:

{\displaystyle {\begin{aligned}S(AB)=-k_{S}\sum _{i,k}q_{k}c_{ik}\ln(q_{k}c_{ik})=-k_{S}\sum _{i,k}q_{k}c_{ik}(\ln q_{k}+\ln c_{ik})=\\=-k_{S}\left(\sum _{i,k}q_{k}c_{ik}\ln q_{k}+\sum _{i,k}q_{k}c_{ik}\ln c_{ik}\right)\end{aligned}}}
However:
${\displaystyle \sum _{i,k}q_{k}c_{ik}\ln q_{k}=\sum _{k}q_{k}\ln q_{k}\sum _{i}c_{ik}=\sum _{k}q_{k}\ln q_{k}}$
and therefore:
${\displaystyle S(AB)=-k_{S}\sum _{k}q_{k}\ln q_{k}-k_{S}\sum _{i,k}q_{k}c_{ik}\ln c_{ik}=S(B)+\left\langle S(A|B_{\ell })\right\rangle }$
from which we have immediately:
${\displaystyle \left\langle S(A|B_{\ell })\right\rangle =S(AB)-S(B)}$

Let us note an interesting consequence of the last property of the entropy: if we suppose ${\displaystyle A_{i}}$ and ${\displaystyle B_{k}}$ to be independent then ${\displaystyle S(A|B_{\ell })=S(A)}$ and we get ${\displaystyle S(AB)=S(A)+S(B)}$: we have thus found that entropy is extensive! Therefore, the last property of ${\displaystyle S}$ is a generalization of the extensivity of the entropy for correlated events.

Now, since these properties are a little bit obscure as we have stated them, let us make a simple example to illustrate their meaning. To make things more comprehensible, we use the same notation of the proof. Suppose you have lost your house keys, and want to find them; you can measure your progress in finding them by measuring your ignorance with some function[2]${\displaystyle S}$. Suppose there are ${\displaystyle \Omega }$ possible places ${\displaystyle A_{i}}$ where you have might left the keys, each one with an estimated probability ${\displaystyle p_{i}}$; the three properties of the entropy ${\displaystyle S}$ can thus be interpreted as follows:

• without further information, your ignorance is maximal and the keys could be in any of the possible places, each with the same probability
• if there is no possibility that the keys might be in a particular place ${\displaystyle A_{\Omega }}$ (your shoe, for example), then your ignorance is no larger than what it would have been if you had not included that place in the list of possible sites
• to improve the research, you are likely to look where you have last seen the keys; let us call these ${\displaystyle M}$ places ${\displaystyle B_{k}}$, each with probability ${\displaystyle q_{k}}$ that the keys are indeed there.

Before thinking where you have last seen the keys, your ignorance about their location is ${\displaystyle S(A)=S(p_{1},\dots ,p_{\Omega })}$ and the one about where you have last seen them is ${\displaystyle S(B)=S(q_{i},\dots ,q_{M})}$; therefore you also have the joint ignorance ${\displaystyle S(AB)}$. If the location where you last seen them is ${\displaystyle B_{\ell }}$ then your ignorance about the location of the keys is ${\displaystyle S(A|B_{\ell })}$, and so your combined ignorance has reduced from ${\displaystyle S(AB)}$ to ${\displaystyle S(A|B_{\ell })}$. You can now measure the usefulness of your guess by determining how much it will reduce your ignorance about where the keys are: the expected ignorance after you have guessed where the keys might have been is given by weighting the ignorance after each guess ${\displaystyle B_{\ell }}$ by the probability of that guess, namely it is ${\displaystyle \left\langle S(A|B_{\ell })\right\rangle }$. The last property of the entropy thus states that after a guess your expected ignorance decreases exactly by the amount ${\displaystyle S(B)}$.

Now, what makes Shannon's entropy as defined in Entropy as ignorance: information entropy so special is the fact that it is unique (of course up to a proportionality constant) and this follows only by the properties that we have just shown it satisfies.

Theorem

Shannon's entropy, defined as:

${\displaystyle S_{S}=-k_{S}\sum _{i}p_{i}\ln p_{i}}$
and satisfying the properties shown in theorem , is unique up to a normalization constant. In other words, the properties shown in theorem uniquely identify Shannon's entropy.

The idea of the proof is the following: we prove that from those properties we have that ${\displaystyle S_{S}=S({\vec {p}})}$ has indeed the expression of Shannon's entropy when[3]${\displaystyle p_{i}\in \mathbb {Q} ^{+}}$. Then, since ${\displaystyle S}$ is a continuous function (property 0 of theorem ) we have that ${\displaystyle S({\vec {p}})}$ has the same expression also for ${\displaystyle p_{i}\in \mathbb {R} ^{+}}$.

Proof

Let us take ${\displaystyle {\vec {q}}\in \mathbb {Q} ^{\Omega }}$, and we write its components as ${\displaystyle q_{\ell }=g_{\ell }/g}$ with ${\displaystyle g,g_{\ell }\in \mathbb {N} }$ (and ${\displaystyle g}$ the least common multiple of the ${\displaystyle g_{\ell }}$-s). We suppose ${\displaystyle r_{k\ell }=1/g}$, so that ${\displaystyle c_{k\ell }=r_{k\ell }/q_{\ell }=1/g_{\ell }}$ (and ${\displaystyle k=1,\dots ,g_{\ell }}$). Defining ${\displaystyle L(g)=S_{S}(1/g,\dots ,1/g)}$, we have:

${\displaystyle S_{S}(AB)=S_{S}(r_{11},r_{12},\dots ,r_{\Omega M})=L(g)}$
${\displaystyle S_{S}(A|B)=\sum _{\ell =1}^{M}q_{\ell }S_{S}(c_{1\ell },\dots ,c_{\Omega \ell })=\sum _{\ell }q_{\ell }L(g_{\ell })}$
From the third property of ${\displaystyle S_{S}}$ we have:
{\displaystyle {\begin{aligned}S_{S}(AB)=S_{S}(A|B)+S_{S}(B)\quad \Rightarrow \quad L(g)=\sum _{\ell }q_{\ell }L(g_{\ell })+S_{S}(q_{1},\dots ,q_{M})\quad \Rightarrow \\\Rightarrow \quad S_{S}(q_{1},\dots ,q_{M})=L(g)-\sum _{\ell }q_{\ell }L(g_{\ell })\end{aligned}}}
Therefore, if we know ${\displaystyle L(g)}$ we can express ${\displaystyle S_{S}}$ (when its arguments are rational, for now). Let us see that from we get the expression of Shannon's entropy if ${\displaystyle L(g)=k\ln g}$, with ${\displaystyle k}$ a generic constant:
{\displaystyle {\begin{aligned}S_{S}(q_{1},\dots ,q_{M})=L(g)-\sum _{\ell }q_{\ell }L(g_{\ell })=k\ln g-k\sum _{\ell }q_{\ell }\ln g_{\ell }=\\=k\sum _{\ell }q_{\ell }(\ln g-\ln g_{\ell })=-k\sum _{\ell }q_{\ell }\ln {\frac {g_{\ell }}{g}}=-k\sum _{\ell }q_{\ell }\ln q_{\ell }\end{aligned}}}
which is exactly Shannon's entropy. We therefore must prove that ${\displaystyle L(g)=k\ln g}$, and in order to do that we will use the first and second properties of ${\displaystyle S_{S}}$. Let us call ${\displaystyle a}$ and ${\displaystyle b}$ two integers such that ${\displaystyle a; then from the second property we have:
${\displaystyle S_{S}\left(\underbrace {{\frac {1}{a}},\dots ,{\frac {1}{a}}} _{a{\text{ arg.}}}\right)=S_{S}\left(\underbrace {{\frac {1}{a}},\dots ,{\frac {1}{a}}} _{a{\text{ arg.}}},\underbrace {0,\dots ,0} _{b-a{\text{ arg.}}}\right)}$
and from the first:
{\displaystyle {\begin{aligned}S_{S}\left(\underbrace {{\frac {1}{a}},\dots ,{\frac {1}{a}}} _{a{\text{ arg.}}},\underbrace {0,\dots ,0} _{b-a{\text{ arg.}}}\right)
so if ${\displaystyle a then ${\displaystyle L(a). Let now ${\displaystyle C^{(1)},C^{(2)},\dots ,C^{(n)}}$ be ${\displaystyle n}$ classes containing each ${\displaystyle g}$ independent events with uniform probability; if we call ${\displaystyle A}$ the set of ${\displaystyle g}$ events in ${\displaystyle C^{(1)}}$ and ${\displaystyle B}$ all the ${\displaystyle g^{n-1}}$ remaining ones, from the third property of ${\displaystyle S_{S}}$ we have:
${\displaystyle S_{S}(AB)=S_{S}(A)+S_{S}(B)\quad \Rightarrow \quad L(g^{n})=L(g)+L(g^{n-1})}$
we thus have found a recursive formula for ${\displaystyle L(g^{n})}$; applying it ${\displaystyle n}$ times we get:
${\displaystyle L(g^{n})=\underbrace {L(g)+L(g)+\cdots +L(g)} _{n{\text{ times}}}+L(1)=nL(g)+L(1)}$
and we set[4]${\displaystyle L(1)=0}$. Therefore, ${\displaystyle L(g^{n})=nL(g)}$. Let us now take ${\displaystyle s\in \mathbb {N} }$; there will surely be an ${\displaystyle m\in \mathbb {N} }$ for which:
${\displaystyle 2^{m}\leq s^{n}<2^{m+1}}$
so (this can be done because ${\displaystyle L(a) if ${\displaystyle a):
{\displaystyle {\begin{aligned}L(2^{m})\leq L(s^{n})
Now, the logarithm ${\displaystyle \ln }$ is a monotonically increasing function so if ${\displaystyle a then ${\displaystyle \ln a<\ln b}$; therefore we find that a similar inequality holds also for the logarithm:
${\displaystyle {\frac {m}{n}}\leq {\frac {\ln s}{\ln 2}}<{\frac {m+1}{n}}}$
This means that ${\displaystyle L(s)/L(2)}$ and ${\displaystyle \ln s/\ln 2}$ both belong to the interval ${\displaystyle [m/n,(m+1)/n]}$, whose width is ${\displaystyle 1/n}$. Thus:
${\displaystyle \left|{\frac {\ln s}{\ln 2}}-{\frac {L(s)}{L(2)}}\right|<{\frac {1}{n}}}$
and taking the limit ${\displaystyle n\to \infty }$ (in fact ${\displaystyle n}$ is arbitrary, so the inequality must hold for all ${\displaystyle n}$s) we obtain:
${\displaystyle L(s)={\frac {L(2)}{\ln 2}}\ln s}$
and renaming ${\displaystyle L(2)/\ln 2}$ as ${\displaystyle k}$, we get:
${\displaystyle L(g)=k\ln g}$
which is exactly what we wanted to prove.

Therefore, we see that the Shannon's entropy is uniquely determined by the properties shown in theorem .

1. The ${\displaystyle \Omega }$ we are using here does not necessarily refer to the phase space volume, it's just a notation for the number of possible states.
2. Since it is the measure of your ignorance, this function is exactly the information entropy we have considered.
3. Note: only for this proof the symbol ${\displaystyle \mathbb {Q} }$ will be used in its usual meaning, namely the set of rational numbers.
4. This is the entropy of a system with only one allowed configurations, so it is null; we could also have kept it but it cancels out in the computations so it is anyway irrelevant.