### there is no mutual information without entropy

The concept of *entropy* is itself confusing and, so far, with high entropy.
When I associate it with the concept of mutual information, its entropy
decreases. Alright, I got my chance to confuse the reader and that was
actually fun.

*Entropy* is one of the most confusing concepts that has been
borrowed by computer scientists from physicists studying thermodynamics.
Just ignore for the next two lines what *entropy* should be. Think about it as
something that measures something else.

Now, imagine a bunch of molecules in a glass at time , temperature and pressure . The entropy of that system would be .
As the temperature decreases, the molecules slow down
and tend to stabilise to a fixed position. As time goes, the entropy of the
system decreases and the information, as the *“certainty”* of the exact
position of each molecule increases. This can be extended to an extreme case
in which the temperature is so low (absolute zero) that all the molecules
remain in a position that we can measure exactly. In that case we are not only
100% sure that the measured position is the real one, but also that the system
cannot come into a different configuration. The entropy of such a system is at
its minimum. No uncertainty. No alternative configurations.

As the temperature decreases, the molecules slow down and tend to stabilise to a fixed position. That’s when entropy is at its minimum

Since we’re not doing physics here, let’s go back to planet earth and do some information theory.
The concept of *entropy* is somehow linked to the **amount of uncertainty** of a system and to the amount of *information* that is present in a random signal.
The entropy at a source that emits a signal with probability is given by

If the message can be represented by an alphabet of symbols,
and the source emits symbols with , the entropy at
the source is
.
Usually, the term is referred to as and called
**information**.

A quite simple explanation of this is that a very frequent symbol (for which would be high) contains little information; a rare symbol, on the other hand, contains a high amount of information about the overall message. It makes perfect sense to me. Or does it?

With this said, let’s jump to the mutual information between two variables .

This quantity measures the mutual dependence between and .

It is given by

, which basically translates into “how much information from knowing , reduces the uncertainty about ?”

In fact, if and are independent, then and . This too makes perfect sense to me. There are some properties that make the link between mutual information and entropy even stronger. I will list a few:

1. , means that the mutual information between and itself is its entropy. Once is known, the amount of uncertainty about itself is indeed its entropy

2. with , one means that the amount of uncertainty about , that remains after is known is .

3. More generally, , which means that a variable contains at least as much information as the one provided by any other variable.

4. Finally, , which means that uncertainty decreases as other variables are known (namely, as the system goes towards a fixed certain state).

One elegant interpretation of entropy in statistics is the Kullback-Leibler divergence

Let’s revisit these concepts in statistics now. One of the most explicative interpretations of mutual information is the one that recalls the Kullback-Leibler distance between distributions.

It represents mutual information as

that I find elegant and amazing at the same time. Let me just add this reconstruction:

that means the more differs from , the higher the amount of “information gain”.

Cool uh?

## Before you go

If you enjoyed this post, you will love the newsletter of Data Science at Home. It’s my FREE digest of the best content in Artificial Intelligence, data science, predictive analytics and computer science. Subscribe!