The wonderful world of positional encoding

I give an intuitive understanding of sinusoidal positional encodings, RoPE, and ALiBi. I finally asses their contribution to the learning process wrt not using any encoding or learned encodings.
Transformers explained
Author

Oleguer Canal

Published

July 1, 2024

“Interestingly”, standard dot-product attention mechanism1 is unaware of element’s positions within a sequence2: Outputs are a weighted average of inputs according to their inter-similarity. Unless encoding positional information somehow, we could forward the input sequence in a permuted order and the output would be the same. Order, however, is a key piece of information for sequential data3.

1 The heart 💙 of the transformer architecture

2 That is assuming no casual masking is applied to the attention matrix. We’ll talk about this in Section 5

3 An unordered sentence, video or audio makes no sense!

Today, we look at the most influential ways of encoding positional information into the transformer architecture:

Before we start, let’s review a couple of obvious approaches that might seem to do the trick good-enough, but don’t work well in practice. This is a nice exercise to surface some desirable properties of positional encodings.

At a first glance, “appending positional information” seems like a trivial task. One might ask:

🤔 How about appending each element’s sequence index to the input as an extra feature?

  • Large numbers: The sequence length can be very long, making these indexes become very large. This value disparity with other inputs can make learning harder, usually one wants more-or-less normalized inputs.
  • Harder generalization: At inference-time, the model might see sentences with different lengths (larger values) than on the training set, making predictions unstable.

🤔 How about instead of appending the index, linearly splitting the 0-1 interval and appending each corresponding value to each input?

  • The main issue with this approach is time-step inconsistency across different input length: An input of 10 elements has a step-size of \(0.1\), while an input of 20 elements has \(0.05\).

Ok, so it seems it won’t be so easy-peasy lemon-squeezy, let’s see how the pros do it!

Learned embeddings

This is the most “lazy” approach: let the network figure it out.

We simply learn a matrix of embeddings, where every row represents a position in the sequence and add those vectors to each corresponding token embeddings.

Models such as BERT (and most of their variants), and all OpenAI’s GPTs (at least until InstructGPT) uses this method. Its main advantages are its simplicity, its low architectural bias introduced, end-to-end learnable. For short sequences it is reported to perform similar to sinusoidal embeddings.

However, it presents some clear disadvantages:

  • Sequence length generalization is an issue: there is no length extrapolation.

  • It is more expensive at train time than pre-defined methods.

  • There is no built-in relative position bias (as in sinusoidal, rotary and ALiBi have).

Sinusoidal positional encodings

This was the approach taken by the original transformer architecture (2017). Given a list of \(T\) tokens: First, they project each token to its corresponding embedding, resulting into a matrix of shape \(T \times d\).4 Later, they simply sum a “temporal information” matrix (of shape \(T \times d\)) on top of the input embeddings. This happens at the very beginning of the transformer architecture:

4 \(d\) being the dimensionality of the embedding.

Where the positional matrix is summed with the input embeddings (red circles)

After the addition, each token embedding contains information on which position it holds in the sequence.

Intuitively explained

Each row of the matrix can be seen as a “timestamp”. A bit similar to what a number of \(d\) digits would look like (one digit for each coordinate). However, instead of using base-10 digits5 the cycles are given by \(sin\) and \(cos\) functions. In particular, they pair contiguous coordinates of the row: the first element of each pair being the sine and the second the cosine of a function with a given frequency. First coordinates have higher frequencies than later coordinates.

5 For reasons similar to what I explained in Tip 1.

Representation of what a given sinusoids embedding might look like.

I give further interpretability of this in Tip 2 and explain advantages of this design choice in Section 2.2.1. But before that, let’s first formalize everything.

Mathematical definition

Let’s see this more formally. Each row of this matrix is a vector defined by:

\[ \begin{align} p_t^{(i)} := \begin{cases} \sin({\omega_k} \cdot t), & \text{if}\ i = 2k \\ \cos({\omega_k} \cdot t), & \text{if}\ i = 2k + 1 \end{cases} \end{align} \]

Where:

\[ \omega_k = \frac{1}{10000^{2k / d}} \]

They call \(k\) “offset” in the paper (referring to the embedding coordinates), and I think the \(10000\) is quite arbitrary, as long as you have a large number, that should work.

How does this matrix actually look like?

This \(T \times d\) matrix looks like this:

Sinusoidal positional embeddings for a sequence of 50 elements (rows) and a 16-dimensional embedding (columns). Notice the high frequency in the first two components (k=1), and how each subsequent component has a lower frequency.

There are many ways of thinking about this, here I put a couple I like.

The clocks analogy

Given a \(d\)-dimensional positional encoding, take the vector at row \(t\):

\[ \left[ \sin \left( \frac{t}{1} \right), \cos \left( \frac{t}{1} \right), \sin \left( \frac{t}{10000^{\frac{2}{d}}} \right), \cos \left( \frac{t}{10000^{\frac{2}{d}}} \right), \sin \left( \frac{t}{10000^{\frac{4}{d}}} \right), \cos \left( \frac{t}{10000^{\frac{4}{d}}} \right), \dots, \sin \left( \frac{t}{10000^1} \right), \cos \left( \frac{t}{10000^1} \right), \right] \]

As \(t\) grows, every component evolves slower: The denominator grows, which makes the frequency lower. I like to pack this array in subsequent pairs of coordinates, and imagine it as an array of \(\frac{d}{2}\) clocks6:

\[ \left[ \underbrace{\sin \left( \frac{t}{1} \right), \cos \left( \frac{t}{1} \right)}_{🕐}, \underbrace{\sin \left( \frac{t}{10000^{\frac{2}{d}}} \right), \cos \left( \frac{t}{10000^{\frac{2}{d}}} \right)}_{🕑}, \underbrace{\sin \left( \frac{t}{10000^{\frac{4}{d}}} \right), \cos \left( \frac{t}{10000^{\frac{4}{d}}} \right)}_{🕒}, \dots, \underbrace{\sin \left( \frac{t}{10000^1} \right), \cos \left( \frac{t}{10000^1} \right)}_{🕧}, \right] \]

As we move through time (advance down the rows of the matrix): The first clocks move very fast (like seconds), and then each one moves progressively slower (minutes, hours, …) as we go to the right. Notice that now the “offset” value \(k\) is analogous to the clock “index” in this vector:

\[ \left[ 🕐 \space \space, 🕑 \space \space, 🕒 \space \space, \dots, 🕧 \space \space \right] \]

This is exactly the same as we are doing with the definition of the positional embedding! But since clocks are not numbers7, we store \(y\) and \(x\) coordinates of each clock’s hand - a way to uniquely identify each clock position. These \(y\) and \(x\) coordinates are the \(sin\) and \(cos\) of the hand’s angle 🤯

Binary encoding analogy

On a different note, we can also see a similar pattern when writing numbers in binary.

Consider 4-bit binary numbers:

\[ \begin{align} &0000_2 = 0_{10} \\ &0001_2 = 1_{10} \\ &0010_2 = 2_{10} \\ &0011_2 = 3_{10} \\ &0100_2 = 4_{10} \\ &0101_2 = 5_{10} \\ &0110_2 = 6_{10} \\ &0111_2 = 7_{10} \\ &1000_2 = 8_{10} \\ &1001_2 = 9_{10} \\ \end{align} \]

We have that the least significant bit (rightmost) changes every time, the next one every 2 times, then every 4, then every 8, etc. In essence, we have the same pattern as before, where one coordinate changes very frequently, and the next one less so, and so on.

This is how a binary-positional-embedding matrix would look like. Also for 50 elements and 16 dimensions.

🤔 Why don’t we do this for our positional encodings?

In summary, binary embeddings don’t have some of the nice properties of sinusoidal embeddings (which we’ll see shortly). In addition, there is no big reason to use something as discrete as 0’s and 1’s to encode our positional information when everything else are floating point numbers. In machine learning, we usually prefer to work with smooth, continuous, and differentiable stuff, to make things easier for the optimizer.

7 Because they are clocks, which is a totally different thing.

6 Imagine the clocks only have one hand

8 Unless working with ridiculously long sequences.

Sinusoidal positional encodings present some very nice properties: uniqueness8, boundedness, absolute-position-awareness, relative-position-awareness…

The first three are quite obvious, so I’d like to develop more on relative position awareness property: There exists a time-independent linear function transformation mapping the embedding from time \(t\) to \(t+n\).

Relative position awareness

Given the offset \(k\) (with coordinates \(2k\) and \(2k + 1\)) of the positional encoding at token-positions \(n\) and \(m\) (times), we have that there exists an absolute-time-independent matrix \(\mathbf{M}_k\) such that:

\[ \mathbf{M}_k (m - n) \cdot \begin{bmatrix} \sin(\omega_k \cdot n) \\ \cos(\omega_k \cdot n) \end{bmatrix} = \begin{bmatrix} \sin(\omega_k \cdot m) \\ \cos(\omega_k \cdot m) \end{bmatrix} \]

In particular the matrix \(\mathbf{M}_k\) is:

\[ \mathbf{M}_k (m - n) = \begin{bmatrix} \cos\bigl(\omega_k (m - n)\bigr) & \sin\bigl(\omega_k (m - n)\bigr) \\ -\sin\bigl(\omega_k (m - n)\bigr) & \cos\bigl(\omega_k (m - n)\bigr) \end{bmatrix}. \]

Interpretation of this linear mapping between two different time-steps (\(n\) and \(m\)), and a given offset \(k\) (in this case \(k=1\)). Of course, this happens for any pair of sin-cos (any \(k\)), but with a different matrix \(\mathbf{M}_k\) for each pair.

In linear algebra, we refer to a matrix of type:

\[ \begin{bmatrix} \cos(\phi) & -\sin(\phi)\\ \sin(\phi) & \cos(\phi)\\ \end{bmatrix} \]

as a rotation matrix. It performs a rotation operation in the euclidean space: It is a transformation which rotates the \(xy\) plane counterclock-wise through an angle \(\theta\) around the origin.

In this case, however, we have a matrix which looks like:

\[ \begin{bmatrix} \cos(\phi) & \sin(\phi)\\ -\sin(\phi) & \cos(\phi)\\ \end{bmatrix} \]

It is still a rotation matrix, but it applies the rotation clock-wise instead. It is easy to see since: \(\sin (-x) = - \sin (x)\) and \(\cos(-x) = \cos(x)\).

As the authors put it: “We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset \(k\), \(P(m)\) can be represented as a linear function of \(P(n)\).”

The hypothesis is that: this linearity makes it easier for the model to understand relative distances between positions. The same function maps the difference between position \(50\) and \(60\) to the difference between position \(1000\) and \(1010\).

This all looks very cool and all, but something that always bothered me was:

🤔 Why summing this matrix to the input embedding? Wouldn’t it be less destructive to just concatenate it?

It’d be less destructive, but we’d need double the embedding dimension, wasting a lot of memory in the process. On top of that, this addition is taken into account in the training process. Most likely, the model will not include as much information in the first dimensions as it does in the later ones. Given that the first dimensions’ values oscillate much faster with input position.

RoPE: Rotary Position Embedding

While sinusoidal positional encodings work well, they present some limitations when it comes to implementing more advanced/efficient versions of the transformer9. The paper RoFormer addresses them by introducing RoPE.

9 For instance: they aren’t as meaningful when compressing subsequences in a single context, or when breaking up sequences across contexts, or when using kernelized variants of the attn mechanism.

RoPE happens directly at the attention mechanism instead of the positional embedding at the beginning.

RoPE is applied in the attention operations (red circles). No positional encoding on the inputs is applied.

Intuitively explained

RoPE’s core idea is actually very close to the sinusoids. However, instead of adding a matrix of “rotations” as before, we directly modify the query and key embeddings. We do this modification right before the dot-product operation. Given a query embedding10, they also do the split-in-pairs thingy of the sinusoids vectors, treat each pair of elements as coordinates in the 2D plane, and apply a rotation to these coordinates.

10 They do exactly the same for key embeddings, just focussing on queries for readability.

The angle of the rotation is given by both: the element’s position in the sequence, and the coordinate offset \(k\) (same as before).

Steps followed by the RoPE function of a given embedding of the queries at position \(n\) and for coordinates \(k\). The process is exacly the same for keys.

Mathematical definition

When dealing with 2D stuff, it is often helpful to express things in terms of complex numbers \(\mathbb{C}\). Mainly because there are a lot of simple formulas that ease computation11. In Tip 5 I explain the connection between both.

11 Of course, we could do everything in \(\mathbb{R}^2\), but it’d be a bit more tedious.

Expressing complex numbers

Remember that a complex number \(z \in \mathbb{C}\) can be both expressed by ts real and imaginary parts:

\[ z = a + i b \]

Where \(i = \sqrt{-1}\) is the imaginary unit and \(a = \text{Re}(z)\) is the real part, and \(b = \text{Im}(z)\) is the imaginary part of \(z\).

Similarly, we can express it by its modulus \(m\) and angle \(\theta\) (aka phase):

\[ z = m \cdot e^{i \cdot \theta} \]

We can find the equivalence of both expressions through basic trigonometry:

\[ \begin{split} a &= m \cos \theta\\ b &= m \sin \theta\\ \end{split} \]

Or:

\[ \begin{split} m &= \sqrt{a^2 + b^2}\\ \theta &= \arctan{\frac{b}{a}}\\ \end{split} \]

Applying rotations

Imagine we want to “rotate” a complex value \(z = a + i \cdot b = m \cdot e^{i \cdot \theta}\) by an angle of \(\phi\). We basically want a new value \(z^\prime\) such that its phase is: \(\theta + \phi\).

Here it is very nice to simply use its exponential form and multiply by \(e^{i \cdot \phi}\):

\[ z^\prime = z \cdot e^{i \cdot \phi} = m \cdot e^{i \cdot (\theta + \phi)} \]

This can also be seen through rotation matrices in \(\mathbb{R}^2\):

\[ \begin{split} a^\prime &= m \cos(\theta + \phi) = m \cos(\theta) \cos(\phi) - m \sin(\theta) \sin(\phi)\\ b^\prime &= m \sin(\theta + \phi) = m \sin(\theta) \cos(\phi) + m \cos(\theta) \sin(\phi) \end{split} \]

Where we usd the sine and cosine sum formulas:

\[ \begin{split} \cos(\alpha + \beta) &= \cos(\alpha) \cos(\beta) - \sin(\alpha) \sin(\beta)\\ \sin(\alpha + \beta) &= \sin(\alpha) \cos(\beta) + \cos(\alpha) \sin(\beta)\\ \end{split} \]

We can further simplify the previous formula using \(a = m \cos \theta\), and \(b = m \sin \theta\):

\[ \begin{split} a^\prime &= m \cos(\theta + \phi) = a \cos(\phi) - b \sin(\phi)\\ b^\prime &= m \sin(\theta + \phi) = b \cos(\phi) + a \sin(\phi) \end{split} \]

Which, if we express in matrix formulation:

\[ \begin{bmatrix} a^\prime\\ b^\prime\\ \end{bmatrix} = \begin{bmatrix} \cos(\phi) & -\sin(\phi)\\ \sin(\phi) & \cos(\phi)\\ \end{bmatrix} \begin{bmatrix} a\\ b\\ \end{bmatrix} \]

Which, again (see Tip 3) is a rotation matrix. This time it applies a counterclock-wise rotation.

More formally, given an embedding \(x\) at position \(n\): \[ x = [ x_0, x_1, x_2, x_3, \dots , x_{d-2}, x_{d-1}] \]

We do the “pairing” of contiguous dimensions, this time expressing the values as complex numbers:

\[ x = (x_0 + i \cdot x_1) , (x_2 + i \cdot x_3), \dots , (x_{d-2} + i \cdot x_{d-1}) \]

We have \(k \in \mathbb{Z}_{0:\frac{d}{2}}\) pairs of complex values. \(k\) is also referred as the “offset” (same as before).

The RoPE operation is defined by applying a rotation of \(n \theta_k\) at each corresponding complex number:

\[ x^\prime = (x_0 + i \cdot x_1) \cdot e^{i \cdot n \cdot \theta_0} , (x_2 + i \cdot x_3) \cdot e^{i \cdot n \cdot \theta_1}, \dots , (x_{d-2} + i \cdot x_{d-1}) \cdot e^{i \cdot n \cdot \theta_{\frac{d}{2}}} \]

Where \(\theta_k\) is the same as before:

\[ \theta_k := \frac{1}{10000^{\frac{2k}{d}}} \]

Notice we can equivalently express the RoPE operation as a linear function without using complex numbers. Given the query (or key) embedding \(x\) at position \(n\):

\[ \begin{bmatrix} x_0^\prime\\ x_1^\prime\\ x_2^\prime\\ x_3^\prime\\ \vdots\\ x_{d-2}^\prime\\ x_{d-1}^\prime\\ \end{bmatrix} = \begin{bmatrix} \cos(n \cdot \theta_0) & -\sin(n \cdot \theta_0) & 0 & 0 & \dots & 0 & 0\\ \sin(n \cdot \theta_0) & \cos(n \cdot \theta_0) & 0 & 0 & \dots & 0 & 0\\ 0 & 0 & \cos(n \cdot \theta_1) & -\sin(n \cdot \theta_1) & \dots & 0 & 0\\ 0 & 0 & \sin(n \cdot \theta_1) & \cos(n \cdot \theta_1) & \dots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \dots & \cos(n \cdot \theta_{\frac{d}{2}}) & -\sin(n \cdot \theta_{\frac{d}{2}})\\ 0 & 0 & 0 & 0 & \dots &\sin(n \cdot \theta_{\frac{d}{2}}) & \cos(n \cdot \theta_{\frac{d}{2}})\\ \end{bmatrix} \begin{bmatrix} x_0\\ x_1\\ x_2\\ x_3\\ \vdots\\ x_{d-2}\\ x_{d-1}\\ \end{bmatrix} \]

ALiBi : Attention with Linear Biases

The Train Short Test Long : Attention With Linear Biases Enables Input Length Extrapolation paper presents yet another way of introducing positional information into the transformer.

In this case, they simply add a “recency” bias directly into the attention matrix. The idea is to linearly reduce the attention to older tokens as presented here:

Figure 1: They construct a bias matrix and add it to the standard attention matrix to down-weight “old” scores. \(m\) is a parameter unique to every head.

Notice that each head \(h \in \{1, \dots, n \}\) gets its own solw \(m_h\). They chose a fixed geometric progression for that:

\[ \begin{equation} m_h \;=\; 2^{-\frac{8h}{n}}, \qquad h=1,\dots,n. \end{equation} \]

For \(n=8\) this yields \(\{2^{-1},\,2^{-2},\,\dots,\,2^{-8}\}\); for \(n=16\) it becomes \(\{2^{-1/2},\,2^{-1},\,2^{-3/2},\,\dots,\,2^{-8}\}\).

This approach is very simple and seemingly quite effective, however the current trend of LLMs still seems to prefer RoPe over it.

Is positional encoding necessary tho?

Wait a moment. Please tell me all this wasn’t for nothing!

Not for nothing but its worth noting that autoregressive decoder-only models don’t actually need positional encoding to work: During training, we are already applying causal masking which lets each token attend to only all its previous tokens. During inference each token can only attend to previous ones (as many as the current step count). So in a way, the positional information is already there.

this is explored in the Transformer Language Models without Positional Encodings Still Learn Positional Information paper. Where they compare the output perplexity applying different positional encodings:

Figure 2: Perplexity of different positional encodings.

In addition, they train a classifier to learn the position of each token given its embeddings after each layer. They obtain similar results across the different positional encoding methods, suggesting that a model without positional information can still perform strongly (at least for sequence of reasonable length).

And that’s pretty much it! Hope you enjoyed this. Like and subscribe! Oleguer out :)