Today weâll look into the concept of information from a probabilistic perspective 1. Hold on to your hat đ because we will connect topics as random as:
your friendâs life-changing trip to Thailand đ
picking up messages from aliens đ
how to see a picture of your own death đ»
the origin and convergence of the universe đȘ
Anyway, to start with a bit of context (pun intended) this paper by Claude Shannon laid the groundwork for the theoretical frameworks for understanding information transmission, data compression, cryptography, and core concepts in machine learning. Pas malâŠ
The plan is to introduce basic concepts of information theory (such as Shannon Information Content, Entropy, Cross-EntropyâŠ). And from those I will go on tangents talking about whatever comes to mind 2.
2 This format is new to me, so letâs see what turns out. Usually I talk about more specific stuff.
Shannon information
Definition: Information gained by an observation of an event under discrete 3 distribution :
3 Why not continuous? Technically the probability of a continuous random variable taking a particular value is 0. Thus, this notion of Shannon information is not defined. There exist extensions to continuous variables, but I wonât open this melon for now.
Encodes âhow surprisingâ an observed event is. 4 The base unit is defined as bit (aka binary digit): 1 bit corresponds to the amount of information gained by observing the outcome of a (i.e. the flip of a perfect coin).
4 Notice: If : is â-surprisingâ If : is â-surprisingâ
But why the logarithm?
It is the only function (up to multiplicative scalar) that meets Shannonâs axioms:
An event with probability 100% is âperfectly unsurprisingâ
Monotonically decreasing: The less probable an event is the more surprising it is the more information it yields
Independence: If two independent variables are measured separately, the amount of information gained, is the sum of self-informations of the individual events.
Code
import numpy as npimport matplotlib.pyplot as pltprob = np.arange(1e-10, 1., 1e-3)info =- np.log(prob)fig, ax = plt.subplots()ax.plot(prob, info)ax.set_xlabel("Probability of the event")ax.set_ylabel("Information gained from obesring the event")ax.grid(True)plt.show()
Figure 1
Rambling: Computer bits and quantum bits
What about the bits in my computer?
Same thing: You can think of reading a stream of bits (e.g. from your disk) as sampling from a distribution. Outcomes can either be 0 or 1 with a 50% probability (if we havenât any prior knowledge). Thus, each yields an information content of 1 bit:
Due to historical and practical reasons (in hardware and software design), a more standard unit of data in computing became 1 byte = 8 bits. With a byte of information you can represent 2^8 = 256 values. Common examples are characters in text (check ASCII) or light intensity in a pixel of a gray-scale image.
We will get back to all this when we talk about data compression in telecommunications.
What about quantum information?
The basic unit of information is called qubit (aka quantum bit): A qubit is a two-state quantum-mechanical system, which can be encoded by multiple particles, for instance: photons, electrons or ions.
When measuring the state of a qubit it collapses to either of 2 states: which can be encoded as 0 and 1. Interestingly, after observation, its internal state gets disturbed, losing some of the characteristics that make it so powerful. However, a qubit presents several properties not captured by classical physics:
Superposition: qubits exist in a state of being both 0 and 1 simultaneously with different degrees (probabilities). This allows quantum computers to process a vast amount of possibilities simultaneously, yielding a huge speedup in certain problems.
Entanglement: qubits can become entangled, where the state of one instantly influences the state of another one, regardless of the distance between them. This property enables the creation of highly correlated states that can be used to perform complex calculations more efficiently than classical computers can.
Examples
Coin
As established, observing the toss of a fair coin gives you 1 bit of information:
Dice
Observing the outcome of a dice contains higher information, because each possibility has a lower probability. Observing an outcome of a gives you:
Metro
Consider a station were a metro passes every 5 minutes. If you arrive to the station at a random moment, the probablity of wait time is described by a uniform distribution .
Which means that the probability of having a wait time of less than minutes is given by:
Thus, the âamount of surpriseâ we get from learning that the wait time is lower than minutes () is:
So we are 0-surprised if we have to wait for less than 5 minutes (since the metro for sure passes in that time-frame, all waiting times are below). However, since it is quite lucky to arrive just when the train is about to come we get very surprised (there are very few waiting times below that). If we plot it:
Code
b =5# Define the range of time values for which we want to plot the PDF, CDF, and Information, now in minst = np.linspace(0, 4.99, 1000) # from 0 to 4 minswait_at_least = (b - t) / binformation =-np.log2(wait_at_least)# Plotting all three: PDF, CDF, and Shannon Information in mins in a single rowfig, axs = plt.subplots(1, 2, figsize=(7, 4))# Plot CDF in minsaxs[0].plot(t, wait_at_least, label='Wait probability', color='green')axs[0].set_title('Wait time less than T prob.')axs[0].set_xlabel('Time (mins)')axs[0].set_ylabel('Probability')axs[0].grid(True)axs[0].legend()# Plot Shannon Information in minsaxs[1].plot(t, information, label='Shannon Information', color='red')axs[1].set_title('Information Content')axs[1].set_xlabel('Time (mins)')axs[1].set_ylabel('Information (bits)')axs[1].grid(True)axs[1].legend()plt.tight_layout()plt.show()
Figure 2
Note: We converted a continuous problem into a discrete one by saying âto wait less thanâŠâ. We apply the same trick in the following example.
Thailand
Consider that friend who mentions their (life-changing) trip to Thailand an average of 3 times every 1 hour of conversation. At any random moment of a conversation, the likelihood of them mentioning Thailand is given by an exponential distribution: .
Which means that the probability of the friend NOT mentioning Thailand in the next hours () is:
Thus, the amount of surprise we get for them NOT mentioning Thailand for hours is:
Therefore, your âamount of surpriseâ augments linearly every minute your friend doesnât say something related to Thailand. If we plot it:
Code
# Update the rate parameter for the exponential distribution to reflect mentions every 1 hourslambda_param =3/1# mentions per hour# Define the range of time values for which we want to plot the PDF, CDF, and Information, now in hourstime_values_hours = np.linspace(0, 2, 1000) # from 0 to 4 hoursnot_mention = np.exp(-lambda_param * time_values_hours)information_values_hours =-np.log2(not_mention)# Plotting all three: PDF, CDF, and Shannon Information in hours in a single rowfig, axs = plt.subplots(1, 2, figsize=(7, 4))# Plot CDF in hoursaxs[0].plot(time_values_hours, not_mention, label='1 - CDF: λ = 3/1 per hour', color='green')axs[0].set_title('Not mentioning Thailand prob')axs[0].set_xlabel('Time (hours)')axs[0].set_ylabel('Probability')axs[0].grid(True)axs[0].legend()# Plot Shannon Information in hoursaxs[1].plot(time_values_hours, information_values_hours, label='Shannon Information', color='red')axs[1].set_title('Information Content')axs[1].set_xlabel('Time (hours)')axs[1].set_ylabel('Information (bits)')axs[1].grid(True)axs[1].legend()plt.tight_layout()plt.show()
Figure 3
Entropy
Definition: Measures the (weighted) average of information content of a probability distribution .
It answers how surprised you expect to be by sampling from .
High entropyMore uncertainty: Many events are similarly likely. More chaos.
Low entropyMore deterministic: Few events are very likely. More order.
If we develop the expectation:
Examples
Coin vs Biased Coin
Given a perfect coin we have an entropy of 1 bit:
However, if we know that one side of the coin is more likely than the other⊠Letâs say that for some strambotic reason the probability of heads is 90% and tails 10%:
Overall, we have lower âaverage surpriseâ. If we predict the outcomes will be heads weâll get most of the answers right. Here the observations are âless chaoticâ or âmore deterministicâ.
Decision trees
[bla bla split on biggest information gain, bla bla]
Rambling: Entropy and famous probabilty distributions
Most well-known probability distributions maximize entropy given some constrains:
Distribution
Conditions for Maximizing Entropy
Uniform
For a given real interval.
Gaussian
For a given mean and variance over the real numbers.
Exponential
For a given mean, for positive real values.
Poisson
For a given mean number of events in a fixed interval.
Binomial
For a given number of trials and success probability.
Geometric
For a given success probability, counting trials until first success.
Beta
For variables in [0,1] with a given mean and variance.
Intuitively, it makes sense to assume maximum entropy when modelling stuff. You go for the most uncertain case possible.
Rambling: Information vs meaning
Consider the images in Figure 4. While they are all represented by the same length of bits, the amount of information varies:
Random image: Is a sample from the categorical distribution of (for each pixel and channel). As seen, this distribution presents the maximum possible entropy in the space of 8-bit data. 5
Good boy image: Is a sample from the dog image distribution. Whatever this probability distribution looks like, its entropy is going to be lower: Pixels in this space present high level of correlation (thus lower information). This greatly reduces the degrees of freedom or âroom for unpredictabilityâ. From a probability perspective, some pixel values have higher probability than others, making the averaged amount of surprise lower 6 .
Noticing something counter-intuitive? Meaning is not the same thing as information. Meaning relates to our subjective understanding of a message, while information is an objective metric. Figure 4 (b) is way more meaningful to us than Figure 4 (c) even presenting a lower information content.
As a matter of fact: absence of information can imply presence of meaning. As we saw, most languages (both human and machine) need redundancy to ensure successful communication. It is plausible to think that extraterrestrial beings trying to communicate across the cosmos, also have redundancy built into their messages. If we intercept some signal with lower-than-expected information content, we could be dealing with alien messages.
This is getting a bit meta, but observing an event picking up a message with low information content, gives us a lot of information, as it is very unlikely.
6 The dog image distribution is extremely complex from an analytical point of view. However, current generative machine learning models are able to approximate its functional form. Allowing us to sample from it and obtain images of dogs never seen before.
5 A cool thought experiment around this is the Canvas of Babel: A website sampling any 4-bit3-channel416x640 pixel image. Around images (just for reference: the number of atoms in the observable universe is around : you can assign around images to each atomâŠ). If you are (incredibly) lucky, you can get a picture of the moment you were born, everything you will ever see (regardless of how spontaneous it feels), a picture showing how you will die (actually, your real death and also many fake ones), pictures containing the secret of immortality⊠Most of the time youâll get gibberish though. More here.
Rambling: Entropy and data compression
This idea presented in Figure 4 is key for data compression: files with greater redundancy (less information) can be compressed more than files with less redundancy (more information).
Not only that, but turns out that entropy is the theoretical limit of data compression: If we randomly select a string from a given probability distribution, then the best average compression ratio we can get for the string is given by the entropy rate of the probability distribution. Check #compression-example for a (very) simplified example of it.
A bit more philosophically: any model we develop of our environment can be thought of a way to compress information. For instance, classical physics formulas encode things like bodies trajectories given some punctual conditions (this way we donât need to store the information for all coordinates). Analogously, artificial neural networks such as language models compress and retain information from trillions of training tokens. They are able to do it by identifying redundancies and patterns in our languages or environment 7 .
Simple example (because why not?)
Consider a language with 3 symbols: A, B, C that appear independently with different frequencies (probabilities):
If we get the entropy:
Notice that when encoding into binary digits there is no notion of symbol change, so we need a way to mark start and finish of each symbol. A possible solution is to use a fixed-length encoding. For this example we need (at least) 2 bits per symbol to encode all three symbols.
However, we saw that the entropy is 1.5 bits per symbol, which, according to the compression limit statement, means that messages can be expressed by less information. Huffman Coding gives us the optimal possible loss-less compression 8 :
Encode A as 0 (1 bit)
Encode B as 10 (2 bits)
Encode B as 11 (2 bits)
Things to notice:
There is no confusion with symbol change as a string of bits presenting these symbols can only be decoded in a single way.
On average, if we sample a string from this dstrbution of the symbols will be A (1 bit), and B and C (2 bits each). So, using this encoding we get an expected amount of information per symbol of: (theoretical minimum).
In practice, for a particular finite set of finite strings, we could achieve much higher compression. However, we are concerned about the average case.
This was a very simple example because I donât want to get too much into compression algorithms (maybe another day we can look into those đ). I hope the idea was captured: More redundance Lower Entropy More Compression.
8 For this type of problem (independent symbols). Data of other nature (such as images or text in current languages) use different methods to compress information.
7 This video is an interesting podcast where Marcus Hutter discusses these ideas
Rambling: Entropy and physics (thermodynamics)
Entropy in physics is defined as:
Where is some positive constant and is the number of microstates compatible with the macroscopic state of the system: A microstate is a specific configuration of the location and velocity of every particle in the system that is consistent with the systemâs overall properties: its macrostate. A macrostate engobes system characterstics such as its energy, volume, number of particles⊠Thus, the more possible ways of particles to be, the higher the entropy.
Figure 4 (a): Macrostate could be âall-blackâ, there is only a single microstate certifying this (0-entropy).
Figure 4 (b): Macrostate could be âdog-pictureâ. While there are many microstates certifying it, there are way fewer than the possibilities of the macrostate âanything-pictureâ.
As you can see, both definitions share a lot of similarities: both are a measure of âdisorderâ, âuncertaintyâ, or âamount of possibilitiesâ. Also notice the parallels (macrostate - distribution, microstate - sample).
If thermodynamicsâ laws happen to be true, an interesting consequence is that the overall entropy of an isolated system can never decrease. It is possible that it locally decreases, but it must be compensated by an equal or larger increase somewhere else. An intuitive way of looking at it, is that there are many more ways of something to be âdisorderedâ, than there are for it to match our notion of âorderâ. Thus, it seems plausible to expect that systems tend to diverge into more âchaoticâ states.
This hypothesis has interesting implications to our understanding of the origin and convergence of the universe as an isolated system:
At the moment of the Big Bang the entropy was very low: Having all matter condensed together gives very little room for variation of possible states (low ).
With the expansion of the universe things become less structured, making entropy increase. Random events (e.g. particle collisions in space or mutations in organisms) lead to systems that might be temporally more or less stable. Some as complex as a đ” (which we label as âaliveâ) 9 .
With enough time, the universe should tend to a completely chaotic âsoupâ of particles so far apart from each other that they never interact. Its macroscopic state being so vague it can be explained by a huge amount of microconfigurations ( and hence entropy tending to infinity).
In summary: We are going from something like Figure 4 (a) to stuff like Figure 4 (b) to things like Figure 4 (c). Anyway, we are having some fun in the meantime đ€Ș
Figure of a funny meme
9 I purposely chose quite a dumb living being for comedic effect. And no, I am not usually the soul of the party.
Caution
This post IS NOT FINISHED: The scope got a bit out of hands because of my inability to stay on topic.
Anyway, this is what I got so far đ€žââïž Coming soon⊠More stuff around: