Today we’ll look into the concept of information from a probabilistic perspective ¹. Hold on to your hat 👒 because we will connect topics as random as:

¹ If very rusted on probability stuff check basics of probability refresher and common probability distributions.

your friend’s life-changing trip to Thailand 🙄
picking up messages from aliens 📞
how to see a picture of your own death 👻
the origin and convergence of the universe 🪐

Anyway, to start with a bit of context (pun intended) this paper by Claude Shannon laid the groundwork for the theoretical frameworks for understanding information transmission, data compression, cryptography, and core concepts in machine learning. Pas mal…

The plan is to introduce basic concepts of information theory (such as Shannon Information Content, Entropy, Cross-Entropy…). And from those I will go on tangents talking about whatever comes to mind ².

² This format is new to me, so let’s see what turns out. Usually I talk about more specific stuff.

Shannon information

Definition: Information gained by an observation of an event $X$ under discrete ³ distribution $P$ :

³ Why not continuous? Technically the probability of a continuous random variable taking a particular value is 0. Thus, this notion of Shannon information is not defined. There exist extensions to continuous variables, but I won’t open this melon for now.

$I_{P} (X) := - \log (P (X))$

Encodes “how surprising” an observed event is. ⁴ The base unit is defined as bit (aka binary digit): 1 bit corresponds to the amount of information gained by observing the outcome of a $B e r n o u l l i (λ = 0.5)$ (i.e. the flip of a perfect coin).

⁴ Notice:
If $p (X) ≃ 1$ : $X$ is “ $0$ -surprising”
If $p (X) ≃ 0$ : $X$ is “ $\infty$ -surprising”

But why the logarithm?

It is the only function (up to multiplicative scalar) that meets Shannon’s axioms:

An event with probability 100% is “perfectly unsurprising”
Monotonically decreasing: The less probable an event is $\to$ the more surprising it is $\to$ the more information it yields
Independence: If two independent variables are measured separately, the amount of information gained, is the sum of self-informations of the individual events.

$I (X \cap Y) = - l o g (P (X \cap Y)) = - l o g (P (X) \cdot P (Y)) = - l o g (P (X)) - l o g (P (Y)) = I (X) + I (Y)$

Code

import numpy as np
import matplotlib.pyplot as plt

prob = np.arange(1e-10, 1., 1e-3)
info = - np.log(prob)
fig, ax = plt.subplots()
ax.plot(prob, info)
ax.set_xlabel("Probability of the event")
ax.set_ylabel("Information gained from obesring the event")
ax.grid(True)
plt.show()

Rambling: Computer bits and quantum bits

What about the bits in my computer?

Same thing: You can think of reading a stream of bits (e.g. from your disk) as sampling from a $B e r n o u l l i (λ = 0.5)$ distribution. Outcomes can either be 0 or 1 with a 50% probability (if we haven’t any prior knowledge). Thus, each yields an information content of 1 bit:

$B e r n o u l l i (λ = 0.5) = - l o g_{2} (\frac{1}{2}) = 1$

Due to historical and practical reasons (in hardware and software design), a more standard unit of data in computing became 1 byte = 8 bits. With a byte of information you can represent 2^8 = 256 values. Common examples are characters in text (check ASCII) or light intensity in a pixel of a gray-scale image.

We will get back to all this when we talk about data compression in telecommunications.

What about quantum information?

The basic unit of information is called qubit (aka quantum bit): A qubit is a two-state quantum-mechanical system, which can be encoded by multiple particles, for instance: photons, electrons or ions.

When measuring the state of a qubit it collapses to either of 2 states: which can be encoded as 0 and 1. Interestingly, after observation, its internal state gets disturbed, losing some of the characteristics that make it so powerful. However, a qubit presents several properties not captured by classical physics:

Superposition: qubits exist in a state of being both 0 and 1 simultaneously with different degrees (probabilities). This allows quantum computers to process a vast amount of possibilities simultaneously, yielding a huge speedup in certain problems.
Entanglement: qubits can become entangled, where the state of one instantly influences the state of another one, regardless of the distance between them. This property enables the creation of highly correlated states that can be used to perform complex calculations more efficiently than classical computers can.

Examples

Coin

As established, observing the toss of a fair coin gives you 1 bit of information:

$I_{Bern (λ = 0.5)} (X = face) = - l o g_{2} (\frac{1}{2}) = 1 bit$

Dice

Observing the outcome of a dice contains higher information, because each possibility has a lower probability. Observing an outcome of a $4$ gives you:

$I_{Cat (λ_{i} = \frac{1}{6} \forall i = 1 : 6)} (X = {4 }) = - l o g_{2} (\frac{1}{6}) ≃ 2.6 bits$

Metro

Consider a station were a metro passes every 5 minutes. If you arrive to the station at a random moment, the probablity of wait time is described by a uniform distribution $U (a = 0, b = 5)$ .

Which means that the probability of having a wait time of less than $T \in [0, 5]$ minutes is given by:

$P (X = T) = \int_{b - T}^{b} U (t; a, b) d t = \int_{b - T}^{b} \frac{1}{b - a} d t = 1 - \frac{T - a}{b - a} = \frac{5 - T}{5}$

Thus, the “amount of surprise” we get from learning that the wait time is lower than $T \in [0, 5]$ minutes ( $X$ ) is:

$I_{X} (T) = - \log_{2} (P (X = T)) = - \log_{2} (\frac{5 - T}{5})$

So we are 0-surprised if we have to wait for less than 5 minutes (since the metro for sure passes in that time-frame, all waiting times are below). However, since it is quite lucky to arrive just when the train is about to come we get very surprised (there are very few waiting times below that). If we plot it:

Code

b = 5

# Define the range of time values for which we want to plot the PDF, CDF, and Information, now in mins
t = np.linspace(0, 4.99, 1000)  # from 0 to 4 mins

wait_at_least = (b - t) / b
information = -np.log2(wait_at_least)

# Plotting all three: PDF, CDF, and Shannon Information in mins in a single row
fig, axs = plt.subplots(1, 2, figsize=(7, 4))

# Plot CDF in mins
axs[0].plot(t, wait_at_least, label='Wait probability', color='green')
axs[0].set_title('Wait time less than T prob.')
axs[0].set_xlabel('Time (mins)')
axs[0].set_ylabel('Probability')
axs[0].grid(True)
axs[0].legend()

# Plot Shannon Information in mins
axs[1].plot(t, information, label='Shannon Information', color='red')
axs[1].set_title('Information Content')
axs[1].set_xlabel('Time (mins)')
axs[1].set_ylabel('Information (bits)')
axs[1].grid(True)
axs[1].legend()

plt.tight_layout()
plt.show()

Note: We converted a continuous problem into a discrete one by saying “to wait less than…”. We apply the same trick in the following example.

Thailand

Consider that friend who mentions their (life-changing) trip to Thailand an average of 3 times every 1 hour of conversation. At any random moment of a conversation, the likelihood of them mentioning Thailand is given by an exponential distribution: $E x p (λ = \frac{3}{1})$ .

Which means that the probability of the friend NOT mentioning Thailand in the next $T$ hours ( $X$ ) is:

$P (X = T) = 1 - \int_{0}^{T} E x p (t; λ) d t = 1 - \int_{0}^{T} λ e^{- λ t} d t = e^{- λ T}$

Thus, the amount of surprise we get for them NOT mentioning Thailand for $T$ hours is:

$I_{X} (T) = - \log_{2} (P (X = T)) = - \log_{2} (e^{- λ T}) = \frac{λ T}{\log_{e} (2)} bits$

Therefore, your “amount of surprise” augments linearly every minute your friend doesn’t say something related to Thailand. If we plot it:

Code

# Update the rate parameter for the exponential distribution to reflect mentions every 1 hours
lambda_param = 3 / 1  # mentions per hour

# Define the range of time values for which we want to plot the PDF, CDF, and Information, now in hours
time_values_hours = np.linspace(0, 2, 1000)  # from 0 to 4 hours

not_mention = np.exp(-lambda_param * time_values_hours)
information_values_hours = -np.log2(not_mention)

# Plotting all three: PDF, CDF, and Shannon Information in hours in a single row
fig, axs = plt.subplots(1, 2, figsize=(7, 4))

# Plot CDF in hours
axs[0].plot(time_values_hours, not_mention, label='1 - CDF: λ = 3/1 per hour', color='green')
axs[0].set_title('Not mentioning Thailand prob')
axs[0].set_xlabel('Time (hours)')
axs[0].set_ylabel('Probability')
axs[0].grid(True)
axs[0].legend()

# Plot Shannon Information in hours
axs[1].plot(time_values_hours, information_values_hours, label='Shannon Information', color='red')
axs[1].set_title('Information Content')
axs[1].set_xlabel('Time (hours)')
axs[1].set_ylabel('Information (bits)')
axs[1].grid(True)
axs[1].legend()

plt.tight_layout()
plt.show()

Entropy

Definition: Measures the (weighted) average of information content of a probability distribution $P$ .

$H (P) := E_{P} [I_{P} (X)]$

It answers how surprised you expect to be by sampling from $P$ .

High entropy $\to$ More uncertainty: Many events are similarly likely. More chaos.
Low entropy $\to$ More deterministic: Few events are very likely. More order.

If we develop the expectation:

$H (P) := \sum_{x \in X} p (x) I_{P} (x) = - \sum_{x \in X} p (x) \log p (x)$

Examples

Coin vs Biased Coin

Given a perfect coin we have an entropy of 1 bit:

$H (Bern (λ = 0.5)) = - (\frac{1}{2} l o g_{2} (\frac{1}{2}) + \frac{1}{2} l o g_{2} (\frac{1}{2})) = 1 bit$

However, if we know that one side of the coin is more likely than the other… Let’s say that for some strambotic reason the probability of heads is 90% and tails 10%:

$H (Bern (λ = 0.9)) = - (\frac{1}{10} l o g_{2} (\frac{1}{10}) + \frac{9}{10} l o g_{2} (\frac{9}{10})) = 0.46 bits$

Overall, we have lower “average surprise”. If we predict the outcomes will be heads we’ll get most of the answers right. Here the observations are “less chaotic” or “more deterministic”.

Decision trees

[bla bla split on biggest information gain, bla bla]

Rambling: Entropy and famous probabilty distributions

Most well-known probability distributions maximize entropy given some constrains:

Distribution	Conditions for Maximizing Entropy
Uniform	For a given real interval.
Gaussian	For a given mean and variance over the real numbers.
Exponential	For a given mean, for positive real values.
Poisson	For a given mean number of events in a fixed interval.
Binomial	For a given number of trials and success probability.
Geometric	For a given success probability, counting trials until first success.
Beta	For variables in [0,1] with a given mean and variance.
Dirichlet	For distributions over a simplex.

Intuitively, it makes sense to assume maximum entropy when modelling stuff. You go for the most uncertain case possible.

Rambling: Information vs meaning

Consider the images in Figure 4. While they are all represented by the same length of bits, the amount of information varies:

Random image: Is a sample from the categorical distribution of $λ = \frac{1}{256}$ (for each pixel and channel). As seen, this distribution presents the maximum possible entropy in the space of $h \times w \times 3$ 8-bit data. ⁵

Good boy image: Is a sample from the dog image distribution. Whatever this probability distribution looks like, its entropy is going to be lower: Pixels in this space present high level of correlation (thus lower information). This greatly reduces the degrees of freedom or “room for unpredictability”. From a probability perspective, some pixel values have higher probability than others, making the averaged amount of surprise lower ⁶ .

Meaning

Noticing something counter-intuitive? Meaning is not the same thing as information. Meaning relates to our subjective understanding of a message, while information is an objective metric. Figure 4 (b) is way more meaningful to us than Figure 4 (c) even presenting a lower information content.

As a matter of fact: absence of information can imply presence of meaning. As we saw, most languages (both human and machine) need redundancy to ensure successful communication. It is plausible to think that extraterrestrial beings trying to communicate across the cosmos, also have redundancy built into their messages. If we intercept some signal with lower-than-expected information content, we could be dealing with alien messages.

This is getting a bit meta, but observing an event $X :=$ picking up a message with low information content, gives us a lot of information, as it is very unlikely.

⁶ The dog image distribution is extremely complex from an analytical point of view. However, current generative machine learning models are able to approximate its functional form. Allowing us to sample from it and obtain images of dogs never seen before.

⁵ A cool thought experiment around this is the Canvas of Babel: A website sampling any 4-bit 3-channel 416x640 pixel image. Around $10^{961755}$ images (just for reference: the number of atoms in the observable universe is around $10^{80}$ : you can assign around $10^{961675}$ images to each atom…). If you are (incredibly) lucky, you can get a picture of the moment you were born, everything you will ever see (regardless of how spontaneous it feels), a picture showing how you will die (actually, your real death and also many fake ones), pictures containing the secret of immortality… Most of the time you’ll get gibberish though. More here.

Rambling: Entropy and data compression

This idea presented in Figure 4 is key for data compression: files with greater redundancy (less information) can be compressed more than files with less redundancy (more information).

Not only that, but turns out that entropy is the theoretical limit of data compression: If we randomly select a string from a given probability distribution, then the best average compression ratio we can get for the string is given by the entropy rate of the probability distribution. Check #compression-example for a (very) simplified example of it.

A bit more philosophically: any model we develop of our environment can be thought of a way to compress information. For instance, classical physics formulas encode things like bodies trajectories given some punctual conditions (this way we don’t need to store the information for all coordinates). Analogously, artificial neural networks such as language models compress and retain information from trillions of training tokens. They are able to do it by identifying redundancies and patterns in our languages or environment ⁷ .

Simple example (because why not?)

Consider a language with 3 symbols: A, B, C that appear independently with different frequencies (probabilities):

$P (A) = \frac{1}{2}, P (B) = \frac{1}{4}, P (C) = \frac{1}{4}$

If we get the entropy:

$H (String of this language) = - (\frac{1}{2} l o g_{2} (\frac{1}{2}) + \frac{1}{4} l o g_{2} (\frac{1}{4}) + \frac{1}{4} l o g_{2} (\frac{1}{4})) = 1.5 bits per symbol$

Notice that when encoding into binary digits there is no notion of symbol change, so we need a way to mark start and finish of each symbol. A possible solution is to use a fixed-length encoding. For this example we need (at least) 2 bits per symbol to encode all three symbols.

However, we saw that the entropy is 1.5 bits per symbol, which, according to the compression limit statement, means that messages can be expressed by less information. Huffman Coding gives us the optimal possible loss-less compression ⁸ :

Encode A as 0 (1 bit)
Encode B as 10 (2 bits)
Encode B as 11 (2 bits)

Things to notice:

There is no confusion with symbol change as a string of bits presenting these symbols can only be decoded in a single way.
On average, if we sample a string from this dstrbution $\frac{1}{2}$ of the symbols will be A (1 bit), and $\frac{1}{4}$ B and C (2 bits each). So, using this encoding we get an expected amount of information per symbol of: $\frac{1}{2} \cdot 1 + \frac{1}{4} \cdot 2 + \frac{1}{4} \cdot 2 bits per symbol$ (theoretical minimum).
In practice, for a particular finite set of finite strings, we could achieve much higher compression. However, we are concerned about the average case.

This was a very simple example because I don’t want to get too much into compression algorithms (maybe another day we can look into those 😉). I hope the idea was captured: More redundance $\to$ Lower Entropy $\to$ More Compression.

⁸ For this type of problem (independent symbols). Data of other nature (such as images or text in current languages) use different methods to compress information.

⁷ This video is an interesting podcast where Marcus Hutter discusses these ideas

Rambling: Entropy and physics (thermodynamics)

Entropy in physics is defined as:

$S := k_{b} \log_{e} (Ω) [\frac{J}{K}]$

Where $k_{b}$ is some positive constant and $Ω$ is the number of microstates compatible with the macroscopic state of the system: A microstate is a specific configuration of the location and velocity of every particle in the system that is consistent with the system’s overall properties: its macrostate. A macrostate engobes system characterstics such as its energy, volume, number of particles… Thus, the more possible ways of particles to be, the higher the entropy.

In terms of Figure 4 we have the follwing cases:

Figure 4 (a): Macrostate could be “all-black”, there is only a single microstate certifying this (0-entropy).
Figure 4 (b): Macrostate could be “dog-picture”. While there are many microstates certifying it, there are way fewer than the possibilities of the macrostate “anything-picture”.

As you can see, both definitions share a lot of similarities: both are a measure of “disorder”, “uncertainty”, or “amount of possibilities”. Also notice the parallels (macrostate - distribution, microstate - sample).

If thermodynamics’ laws happen to be true, an interesting consequence is that the overall entropy of an isolated system can never decrease. It is possible that it locally decreases, but it must be compensated by an equal or larger increase somewhere else. An intuitive way of looking at it, is that there are many more ways of something to be “disordered”, than there are for it to match our notion of “order”. Thus, it seems plausible to expect that systems tend to diverge into more “chaotic” states.

This hypothesis has interesting implications to our understanding of the origin and convergence of the universe as an isolated system:

At the moment of the Big Bang the entropy was very low: Having all matter condensed together gives very little room for variation of possible states (low $Ω$ ).
With the expansion of the universe things become less structured, making entropy increase. Random events (e.g. particle collisions in space or mutations in organisms) lead to systems that might be temporally more or less stable. Some as complex as a 🌵 (which we label as “alive”) ⁹ .
With enough time, the universe should tend to a completely chaotic “soup” of particles so far apart from each other that they never interact. Its macroscopic state being so vague it can be explained by a huge amount of microconfigurations ( $Ω$ and hence entropy tending to infinity).

In summary: We are going from something like Figure 4 (a) to stuff like Figure 4 (b) to things like Figure 4 (c). Anyway, we are having some fun in the meantime 🤪

⁹ I purposely chose quite a dumb living being for comedic effect. And no, I am not usually the soul of the party.

Caution

This post IS NOT FINISHED: The scope got a bit out of hands because of my inability to stay on topic.

Anyway, this is what I got so far 🤸‍♂️ Coming soon… More stuff around:

Cross-Entropy

Joint Entropy
Conditional Entropy
Mutual Information