Shannon's entropy

From what we have just seen we can introduce a more general definition of entropy for a generic system that can be found in different discrete states[1], each with probability (and ). This is known as Shannon's entropy, which is defined as:

the constant is used instead of Boltzmann's constant because in general the connection to temperature can be irrelevant, and it is defined as so that Shannon's entropy can be also rewritten as:
The unit of measure of this entropy is the bit. Shannon's entropy is particularly useful because it is the only one that satisfies three important properties, which we now show.


Let be a set of stochastic variables (they are probabilities, therefore such that and ). Then Shannon's entropy (which is a function of this set of variables) as defined in satisfies the following properties:

  • (Property zero) It is a continuous function

  • (Property 1) is maximized when all the -s are uniform:

  • (Property 2) is not affected by extra states of zero probability:

(where on the left side has arguments)

  • (Property 3) changes for conditional probabilities.

Let and be two sets of events (with and ), each with probability and . If we define:

which, by definition, satisfy:
then we can introduce:
which are, respectively, the total entropy associated to the occurrence of events and, the entropy associated to the occurrence of events and the entropy associated to the occurrence of events given. Therefore, defining:
we have that satisfies:



  • (Property zero) It is immediate from its definition.

  • (Property 1) First of all, let us note that the function is concave since (and in our case since it represents a probability).

In general, from the definition of concave function we have:

From this we can prove by induction that:
This is surely true for :
since it follows directly from with , and . If we now suppose that the inequality holds for probabilities:
it follows that it holds also for . In fact:
and choosing :

Now, considering the case we will have that:

where the inequality is strict because in this case .

  • (Property 2) This follows immediately from the fact that can be continuously extended in , so that (as we have previously stated). Therefore if we add to , it will not contribute to since by definition .

  • (Property 3) Since , we have:

and therefore:
from which we have immediately:


Let us note an interesting consequence of the last property of the entropy: if we suppose and to be independent then and we get : we have thus found that entropy is extensive! Therefore, the last property of is a generalization of the extensivity of the entropy for correlated events.

Now, since these properties are a little bit obscure as we have stated them, let us make a simple example to illustrate their meaning. To make things more comprehensible, we use the same notation of the proof. Suppose you have lost your house keys, and want to find them; you can measure your progress in finding them by measuring your ignorance with some function[2]. Suppose there are possible places where you have might left the keys, each one with an estimated probability ; the three properties of the entropy can thus be interpreted as follows:

  • without further information, your ignorance is maximal and the keys could be in any of the possible places, each with the same probability
  • if there is no possibility that the keys might be in a particular place (your shoe, for example), then your ignorance is no larger than what it would have been if you had not included that place in the list of possible sites
  • to improve the research, you are likely to look where you have last seen the keys; let us call these places , each with probability that the keys are indeed there.

Before thinking where you have last seen the keys, your ignorance about their location is and the one about where you have last seen them is ; therefore you also have the joint ignorance . If the location where you last seen them is then your ignorance about the location of the keys is , and so your combined ignorance has reduced from to . You can now measure the usefulness of your guess by determining how much it will reduce your ignorance about where the keys are: the expected ignorance after you have guessed where the keys might have been is given by weighting the ignorance after each guess by the probability of that guess, namely it is . The last property of the entropy thus states that after a guess your expected ignorance decreases exactly by the amount .

Now, what makes Shannon's entropy as defined in Entropy as ignorance: information entropy so special is the fact that it is unique (of course up to a proportionality constant) and this follows only by the properties that we have just shown it satisfies.


Shannon's entropy, defined as:

and satisfying the properties shown in theorem , is unique up to a normalization constant. In other words, the properties shown in theorem uniquely identify Shannon's entropy.


The idea of the proof is the following: we prove that from those properties we have that has indeed the expression of Shannon's entropy when[3]. Then, since is a continuous function (property 0 of theorem ) we have that has the same expression also for .


Let us take , and we write its components as with (and the least common multiple of the -s). We suppose , so that (and ). Defining , we have:

From the third property of we have:
Therefore, if we know we can express (when its arguments are rational, for now). Let us see that from we get the expression of Shannon's entropy if , with a generic constant:
which is exactly Shannon's entropy. We therefore must prove that , and in order to do that we will use the first and second properties of . Let us call and two integers such that ; then from the second property we have:
and from the first:
so if then . Let now be classes containing each independent events with uniform probability; if we call the set of events in and all the remaining ones, from the third property of we have:
we thus have found a recursive formula for ; applying it times we get:
and we set[4]. Therefore, . Let us now take ; there will surely be an for which:
so (this can be done because if ):
Now, the logarithm is a monotonically increasing function so if then ; therefore we find that a similar inequality holds also for the logarithm:
This means that and both belong to the interval , whose width is . Thus:
and taking the limit (in fact is arbitrary, so the inequality must hold for all s) we obtain:
and renaming as , we get:
which is exactly what we wanted to prove.

Therefore, we see that the Shannon's entropy is uniquely determined by the properties shown in theorem .

  1. The we are using here does not necessarily refer to the phase space volume, it's just a notation for the number of possible states.
  2. Since it is the measure of your ignorance, this function is exactly the information entropy we have considered.
  3. Note: only for this proof the symbol will be used in its usual meaning, namely the set of rational numbers.
  4. This is the entropy of a system with only one allowed configurations, so it is null; we could also have kept it but it cancels out in the computations so it is anyway irrelevant.