From what we have just seen we can introduce a more general definition of entropy for a generic system that can be found in
different discrete states[1], each with probability
(and
). This is known as Shannon's entropy, which is defined as:

the constant

is used instead of Boltzmann's constant because in general the connection to temperature can be irrelevant, and it is defined as

so that Shannon's entropy can be also rewritten as:

The unit of measure of this entropy is the
bit.
Shannon's entropy is particularly useful because it is the only one that satisfies three important properties, which we now show.
Theorem
Let
be a set of stochastic variables (they are probabilities, therefore such that
and
). Then Shannon's entropy
(which is a function of this set of variables) as defined in satisfies the following properties:
- (Property zero) It is a continuous function
- (Property 1)
is maximized when all the
-s are uniform:

- (Property 2)
is not affected by extra states of zero probability:

(where on the left side

has

arguments)
- (Property 3)
changes for conditional probabilities.
Let
and
be two sets of events (with
and
), each with probability
and
. If we define:

which, by definition, satisfy:

then we can introduce:


which are, respectively, the total entropy associated to the occurrence of events
and
, the entropy associated to the occurrence of events

and the entropy associated to the occurrence of events
given
.
Therefore, defining:

we have that

satisfies:

Proof
- (Property zero) It is immediate from its definition.
- (Property 1) First of all, let us note that the function
is concave since
(and in our case
since it represents a probability).
In general, from the definition of concave function we have:

From this we can prove by induction that:

This is surely true for

:

since it follows directly from with

,

and

. If we now suppose that the inequality holds for

probabilities:

it follows that it holds also for

. In fact:

and choosing

:

Therefore:

Now, considering the case
we will have that:

where the inequality is strict because in this case

.
- (Property 2) This follows immediately from the fact that
can be continuously extended in
, so that
(as we have previously stated). Therefore if we add
to
, it will not contribute to
since by definition
.
- (Property 3) Since
, we have:

However:

and therefore:

from which we have immediately:

Let us note an interesting consequence of the last property of the entropy: if we suppose
and
to be independent then
and we get
: we have thus found that entropy is extensive! Therefore, the last property of
is a generalization of the extensivity of the entropy for correlated events.
Now, since these properties are a little bit obscure as we have stated them, let us make a simple example to illustrate their meaning. To make things more comprehensible, we use the same notation of the proof.
Suppose you have lost your house keys, and want to find them; you can measure your progress in finding them by measuring your ignorance with some function[2]
. Suppose there are
possible places
where you have might left the keys, each one with an estimated probability
; the three properties of the entropy
can thus be interpreted as follows:
- without further information, your ignorance is maximal and the keys could be in any of the possible places, each with the same probability
- if there is no possibility that the keys might be in a particular place
(your shoe, for example), then your ignorance is no larger than what it would have been if you had not included that place in the list of possible sites
- to improve the research, you are likely to look where you have last seen the keys; let us call these
places
, each with probability
that the keys are indeed there.
Before thinking where you have last seen the keys, your ignorance about their location is
and the one about where you have last seen them is
; therefore you also have the joint ignorance
. If the location where you last seen them is
then your ignorance about the location of the keys is
, and so your combined ignorance has reduced from
to
.
You can now measure the usefulness of your guess by determining how much it will reduce your ignorance about where the keys are: the expected ignorance after you have guessed where the keys might have been is given by weighting the ignorance after each guess
by the probability of that guess, namely it is
. The last property of the entropy thus states that after a guess your expected ignorance decreases exactly by the amount
.
Now, what makes Shannon's entropy as defined in Entropy as ignorance: information entropy so special is the fact that it is unique (of course up to a proportionality constant) and this follows only by the properties that we have just shown it satisfies.
Theorem
Shannon's entropy, defined as:

and satisfying the properties shown in theorem , is unique up to a normalization constant. In other words, the properties shown in theorem uniquely identify Shannon's entropy.
The idea of the proof is the following: we prove that from those properties we have that
has indeed the expression of Shannon's entropy when[3]
. Then, since
is a continuous function (property 0 of theorem ) we have that
has the same expression also for
.
Proof
Let us take
, and we write its components as
with
(and
the least common multiple of the
-s). We suppose
, so that
(and
). Defining
, we have:


From the third property of

we have:

Therefore, if we know

we can express

(when its arguments are rational, for now).
Let us see that from we get the expression of Shannon's entropy if

, with

a generic constant:

which is exactly Shannon's entropy. We therefore must prove that

, and in order to do that we will use the first and second properties of

.
Let us call

and

two integers such that

; then from the second property we have:

and from the first:

so if

then

.
Let now

be

classes containing each

independent events with uniform probability; if we call

the set of

events in

and

all the

remaining ones, from the third property of

we have:

we thus have found a recursive formula for

; applying it

times we get:

and we set
[4]
. Therefore,

.
Let us now take

; there will surely be an

for which:

so (this can be done because

if

):

Now, the logarithm

is a monotonically increasing function so if

then

; therefore we find that a similar inequality holds also for the logarithm:

This means that

and

both belong to the interval
![{\displaystyle [m/n,(m+1)/n]}](//restbase.wikitolearn.org/en.wikitolearn.org/v1/media/math/render/svg/0da2d6cd00511ab2e164431ea5aa89f8811ab854)
, whose width is

. Thus:

and taking the limit

(in fact

is arbitrary, so the inequality must hold for all

s) we obtain:

and renaming

as

, we get:

which is exactly what we wanted to prove.
Therefore, we see that the Shannon's entropy is uniquely determined by the properties shown in theorem .
- ↑ The
we are using here does not necessarily refer to the phase space volume, it's just a notation for the number of possible states.
- ↑ Since it is the measure of your ignorance, this function is exactly the information entropy we have considered.
- ↑ Note: only for this proof the symbol
will be used in its usual meaning, namely the set of rational numbers.
- ↑ This is the entropy of a system with only one allowed configurations, so it is null; we could also have kept it but it cancels out in the computations so it is anyway irrelevant.