# Shannon's entropy

From what we have just seen we can introduce a more general definition of entropy for a generic system that can be found in different discrete states^{[1]}, each with probability (and ). This is known as *Shannon's entropy*, which is defined as:

*bit*. Shannon's entropy is particularly useful because it is the only one that satisfies three important properties, which we now show.

**Theorem**

Let be a set of stochastic variables (they are probabilities, therefore such that and ). Then Shannon's entropy (which is a function of this set of variables) as defined in satisfies the following properties:

- (
*Property zero*) It is a continuous function

- (
*Property 1*) is maximized when all the -s are uniform:

- (
*Property 2*) is not affected by extra states of zero probability:

- (
*Property 3*) changes for conditional probabilities.

Let and be two sets of events (with and ), each with probability and . If we define:

*and*, the entropy associated to the occurrence of events and the entropy associated to the occurrence of events

*given*. Therefore, defining:

*Proof*

- (
*Property zero*) It is immediate from its definition.

- (
*Property 1*) First of all, let us note that the function is concave since (and in our case since it represents a probability).

In general, from the definition of concave function we have:

Now, considering the case we will have that:

- (
*Property 2*) This follows immediately from the fact that can be continuously extended in , so that (as we have previously stated). Therefore if we add to , it will not contribute to since by definition .

- (
*Property 3*) Since , we have:

Let us note an interesting consequence of the last property of the entropy: if we suppose and to be independent then and we get : we have thus found that entropy is extensive! Therefore, the last property of is a generalization of the extensivity of the entropy for correlated events.

Now, since these properties are a little bit obscure as we have stated them, let us make a simple example to illustrate their meaning. To make things more comprehensible, we use the same notation of the proof.
Suppose you have lost your house keys, and want to find them; you can measure your progress in finding them by measuring your ignorance with some function^{[2]}. Suppose there are possible places where you have might left the keys, each one with an estimated probability ; the three properties of the entropy can thus be interpreted as follows:

- without further information, your ignorance is maximal and the keys could be in any of the possible places, each with the same probability
- if there is no possibility that the keys might be in a particular place (your shoe, for example), then your ignorance is no larger than what it would have been if you had not included that place in the list of possible sites
- to improve the research, you are likely to look where you have last seen the keys; let us call these places , each with probability that the keys are indeed there.

Before thinking where you have last seen the keys, your ignorance about their location is and the one about where you have last seen them is ; therefore you also have the joint ignorance . If the location where you last seen them is then your ignorance about the location of the keys is , and so your combined ignorance has reduced from to . You can now measure the usefulness of your guess by determining how much it will reduce your ignorance about where the keys are: the expected ignorance after you have guessed where the keys might have been is given by weighting the ignorance after each guess by the probability of that guess, namely it is . The last property of the entropy thus states that after a guess your expected ignorance decreases exactly by the amount .

Now, what makes Shannon's entropy as defined in Entropy as ignorance: information entropy so special is the fact that it is unique (of course up to a proportionality constant) and this follows only by the properties that we have just shown it satisfies.

**Theorem**

Shannon's entropy, defined as:

The idea of the proof is the following: we prove that from those properties we have that has indeed the expression of Shannon's entropy when^{[3]}. Then, since is a continuous function (property 0 of theorem ) we have that has the same expression also for .

*Proof*

Let us take , and we write its components as with (and the least common multiple of the -s). We suppose , so that (and ). Defining , we have:

^{[4]}. Therefore, . Let us now take ; there will surely be an for which:

Therefore, we see that the Shannon's entropy is uniquely determined by the properties shown in theorem .

- ↑ The we are using here does not necessarily refer to the phase space volume, it's just a notation for the number of possible states.
- ↑ Since it is the measure of your ignorance, this function is exactly the information entropy we have considered.
- ↑ Note: only for this proof the symbol will be used in its usual meaning, namely the set of rational numbers.
- ↑ This is the entropy of a system with only one allowed configurations, so it is null; we could also have kept it but it cancels out in the computations so it is anyway irrelevant.