Understanding positional encodings

I try to give an intuitive understanding of sinusoidal positional encodings, RoPE,and the relationship between them.
Author

Oleguer Canal

Published

July 1, 2024

“Interestingly”, standard dot-product attention mechanism1 is unaware of element’s positions within a sequence: Outputs are a weighted average of inputs according to their inter-similarity. Unless encoding positional information somehow, we could forward the input sequence in a permuted order and the output would be the same. Order, however, is a key piece of information for sequential data2.

1 The heart 💙 of the transformer architecture

2 An unordered sentence, video or audio makes no sense!

Today, we look at the two most influential ways of encoding positional information into the transformer architecture:

Before we start, let’s review a couple of obvious approaches that might seem to do the trick good-enough, but don’t work well in practice. This is a nice exercise to surface some desirable properties of positional encodings.

At a first glance, “appending positional information” seems like a trivial task. One might ask:

🤔 How about appending each element’s sequence index to the input as an extra feature?

  • Large numbers: The sequence length can be very long, making these indexes become very large. This value disparity with other inputs can make learning harder, usually one wants more-or-less normalized inputs.
  • Harder generalization: At inference-time, the model might see sentences with different lengths (larger values) than on the training set, making predictions unstable.

🤔 How about instead of appending the index, linearly splitting the 0-1 interval and appending each corresponding value to each input?

  • The main issue with this approach is time-step inconsistency across different input length: An input of 10 elements has a step-size of 0.1, while an input of 20 elements has 0.05.

Ok, so it seems it won’t be so easy-peasy lemon-squeezy, let’s see how the pros do it!

Sinusoidal positional encodings

This was the approach taken by the original transformer architecture (2017). Given a list of T tokens: First, they project each token to its corresponding embedding, resulting into a matrix of shape T×d.3 Later, they simply sum a “temporal information” matrix (of shape T×d) on top of the input embeddings. This happens at the very beginning of the transformer architecture:

3 d being the dimensionality of the embedding.

Where the positional matrix is summed with the input embeddings (red circles)

After the addition, each token embedding contains information on which position it holds in the sequence.

Intuitively explained

Each row of the matrix can be seen as a “timestamp”. A bit similar to what a number of d digits would look like (one digit for each coordinate). However, instead of using base-10 digits4 the cycles are given by sin and cos functions. In particular, they pair contiguous coordinates of the row: the first element of each pair being the sine and the second the cosine of a function with a given frequency. First coordinates have higher frequencies than later coordinates.

4 For reasons similar to what I explained in Tip 1.

Representation of what a given sinusoids embedding might look like.

I give further interpretability of this in Tip 2 and explain advantages of this design choice in @Relative-position-awareness. But before that, let’s first formalize everything.

Mathematical definition

Each row of this matrix is a vector defined by:

pt(i):={sin(ωkt),if i=2kcos(ωkt),if i=2k+1

Where:

ωk=1100002k/d

They call k “offset” in the paper (referring to the embedding coordinates), and I think the 10000 is quite arbitrary, as long as you have a large number, that should work.

How does this matrix actually look like?

This T×d matrix looks like this:

Sinusoidal positional embeddings for a sequence of 50 elements (rows) and a 16-dimensional embedding (columns). Notice the high frequency in the first two components (k=1), and how each subsequent component has a lower frequency.

There are many ways of thinking about this, here I put a couple I like.

The clocks analogy

Given a d-dimensional positional encoding, take the vector at row t:

[sin(t1),cos(t1),sin(t100002d),cos(t100002d),sin(t100004d),cos(t100004d),,sin(t100001),cos(t100001),]

As t grows, every component evolves slower: The denominator grows, which makes the frequency lower. I like to pack this array in subsequent pairs of coordinates, and imagine it as an array of d2 clocks5:

[sin(t1),cos(t1)🕐,sin(t100002d),cos(t100002d)🕑,sin(t100004d),cos(t100004d)🕒,,sin(t100001),cos(t100001)🕧,]

As we move through time (advance down the rows of the matrix): The first clocks move very fast (like seconds), and then each one moves progressively slower (minutes, hours, …) as we go to the right. Notice that now the “offset” value k is analogous to the clock “index” in this vector:

[🕐  ,🕑  ,🕒  ,,🕧  ]

This is exactly the same as we are doing with the definition of the positional embedding! But since clocks are not numbers6, we store y and x coordinates of each clock’s hand - a way to uniquely identify each clock position. These y and x coordinates are the sin and cos of the hand’s angle 🤯

Binary encoding analogy

On a different note, we can also see a similar pattern when writing numbers in binary.

Consider 4-bit binary numbers:

00002=01000012=11000102=21000112=31001002=41001012=51001102=61001112=71010002=81010012=910

We have that the least significant bit (rightmost) changes every time, the next one every 2 times, then every 4, then every 8, etc. In essence, we have the same pattern as before, where one coordinate changes very frequently, and the next one less so, and so on.

This is how a binary-positional-embedding matrix would look like. Also for 50 elements and 16 dimensions.

🤔 Why don’t we do this for our positional encodings?

In summary, binary embeddings don’t have some of the nice properties of sinusoidal embeddings (which we’ll see shortly). In addition, there is no big reason to use something as discrete as 0’s and 1’s to encode our positional information when everything else are floating point numbers. In machine learning, we usually prefer to work with smooth, continuous, and differentiable stuff, to make things easier for the optimizer.

6 Because they are clocks, which is a totally different thing.

5 Imagine the clocks only have one hand

Sinusoidal positional encodings present some very nice properties: uniqueness, boundedness, absolute-position-awareness, relative-position-awareness… I’d like to develop more on relative position awareness property: there exists a time-independent linear function transformation mapping the embedding from time t to t+n.

Relative position awareness

Given the coordinates k and k+1 of the positional encoding at times n and m, we have that there exists an absolute-time-independent matrix Mk such that:

Mk(mn)[sin(ωkn)cos(ωkn)]=[sin(ωkm)cos(ωkm)]

In particular the matrix Mk is:

Mk(mn)=[cos(ωk(mn))sin(ωk(mn))sin(ωk(mn))cos(ωk(mn))].

Interpretation of this linear mapping between two different time-steps (n and m), and a given offset k (in this case k=1). Of course, this happens for any pair of sin-cos (any k), but with a different matrix Mk for each pair.

In linear algebra, we refer to a matrix of type:

[cos(ϕ)sin(ϕ)sin(ϕ)cos(ϕ)]

as a rotation matrix. It performs a rotation operation in the euclidean space: It is a transformation which rotates the xy plane counterclock-wise through an angle θ around the origin.

In this case, however, we have a matrix which looks like:

[cos(ϕ)sin(ϕ)sin(ϕ)cos(ϕ)]

It is still a rotation matrix, but it applies the rotation clock-wise instead. It is easy to see since: sin(x)=sin(x) and cos(x)=cos(x).

As the authors put it: “We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, P(m) can be represented as a linear function of P(n).”

The hypothesis is that this linearity makes it easier for the model to understand relative distances between positions: The same function maps the difference between position 50 and 60 to the difference between position 1000 and 1010.

This all looks very cool and all, but something that always bothered me was:

🤔 Why summing this matrix to the input embedding? Wouldn’t it be less destructive to just concatenate it?

It’d be less destructive, but we’d need double the embedding dimension, wasting a lot of memory in the process.

The counter-argument is that this addition is taken into account in the training process. Most likely, the model will not include as much information in the first dimensions as it does in the later ones. Given that the first dimensions’ values oscillate much faster with input position.

RoPE: Rotary Position Embedding

While sinusoidal positional encodings work well, they present some limitations when it comes to implementing more advanced/efficient versions of the transformer7. The paper RoFormer addresses them by introducing RoPE.

7 For instance: they aren’t as meaningful when compressing subsequences in a single context, or when breaking up sequences across contexts, or when using kernelized variants of the attn mechanism.

RoPE happens directly at the attention mechanism instead of the positional embedding at the beginning.

RoPE is applied in the attention operations (red circles). No positional encoding on the inputs is applied.

Intuitively explained

RoPE’s core idea is actually very close to the sinusoids. However, instead of adding a matrix of “rotations” as before, we are directly modify the query and key embeddings. We do this modification right before the dot-product operation. Given a query embedding8, they also do the split-in-pairs thingy of the sinusoids vectors, treat each pair of elements as coordinates in the 2D plane, and apply a rotation to these coordinates.

8 They do exactly the same for key embeddings, just focussing on queries for readability.

The angle of the rotation is given by both: the element’s position in the sequence, and the coordinate offset k (same as before).

Steps followed by the RoPE function of a given embedding of the queries at position n and for coordinates k. The process is exacly the same for keys.

Mathematical definition

When dealing with 2D stuff, it is often helpful to express things in terms of complex numbers C. Mainly because there are a lot of simple formulas that ease computation9. In Tip 5 I explain the connection between both.

9 Of course, we could do everything in R2, but it’d be a bit more tedious.

Expressing complex numbers

Remember that a complex number zC can be both expressed by ts real and imaginary parts:

z=a+ib

Where i=1 is the imaginary unit and a=Re(z) is the real part, and b=Im(z) is the imaginary part of z.

Similarly, we can express it by its modulus m and angle θ (aka phase):

z=meiθ

We can find the equivalence of both expressions through basic trigonometry:

a=mcosθb=msinθ

Or:

m=a2+b2θ=arctanba

Applying rotations

Imagine we want to “rotate” a complex value z=a+ib=meiθ by an angle of ϕ. We basically want a new value z such that its phase is: θ+ϕ.

Here it is very nice to simply use its exponential form and multiply by eiϕ:

z=zeiϕ=mei(θ+ϕ)

This can also be seen through rotation matrices in R2:

a=mcos(θ+ϕ)=mcos(θ)cos(ϕ)msin(θ)sin(ϕ)b=msin(θ+ϕ)=msin(θ)cos(ϕ)+mcos(θ)sin(ϕ)

Where we usd the sine and cosine sum formulas:

cos(α+β)=cos(α)cos(β)sin(α)sin(β)sin(α+β)=sin(α)cos(β)+cos(α)sin(β)

We can further simplify the previous formula using a=mcosθ, and b=msinθ:

a=mcos(θ+ϕ)=acos(ϕ)bsin(ϕ)b=msin(θ+ϕ)=bcos(ϕ)+asin(ϕ)

Which, if we express in matrix formulation:

[ab]=[cos(ϕ)sin(ϕ)sin(ϕ)cos(ϕ)][ab]

Which, again (see Tip 3) is a rotation matrix. This time it applies a counterclock-wise rotation.

More formally, given an embedding x at position n: x=[x0,x1,x2,x3,,xd2,xd1]

We do the “pairing” of contiguous dimensions, this time expressing the values as complex numbers:

x=(x0+ix1),(x2+ix3),,(xd2+ixd1)

We have kZ0:d2 pairs of complex values. k is also referred as the “offset” (same as before).

The RoPE operation is defined by applying a rotation of nθk at each corresponding complex number:

x=(x0+ix1)einθ0,(x2+ix3)einθ1,,(xd2+ixd1)einθd2

Where θk is the same as before:

θk:=1100002kd

Notice we can equivalently express the RoPE operation as a linear function without using complex numbers. Given the query (or key) embedding x at position n:

[x0x1x2x3xd2xd1]=[cos(nθ0)sin(nθ0)0000sin(nθ0)cos(nθ0)000000cos(nθ1)sin(nθ1)0000sin(nθ1)cos(nθ1)000000cos(nθd2)sin(nθd2)0000sin(nθd2)cos(nθd2)][x0x1x2x3xd2xd1]