“Interestingly”, standard dot-product attention mechanism1 is unaware of element’s positions within a sequence: Outputs are a weighted average of inputs according to their inter-similarity. Unless encoding positional information somehow, we could forward the input sequence in a permuted order and the output would be the same. Order, however, is a key piece of information for sequential data2.
1 The heart 💙 of the transformer architecture
2 An unordered sentence, video or audio makes no sense!
Today, we look at the two most influential ways of encoding positional information into the transformer architecture:
Sinusoidal embeddings [2017]: The original approach presented in Attention is all you need.
Rotary Position Embedding (ROPE) [2021]: Presented in the RoFormer paper and now widely adopted by most modern architectures: DeepSeek-v3, Qwen, Llama-3…
Before we start, let’s review a couple of obvious approaches that might seem to do the trick good-enough, but don’t work well in practice. This is a nice exercise to surface some desirable properties of positional encodings.
At a first glance, “appending positional information” seems like a trivial task. One might ask:
🤔 How about appending each element’s sequence index to the input as an extra feature?
- Large numbers: The sequence length can be very long, making these indexes become very large. This value disparity with other inputs can make learning harder, usually one wants more-or-less normalized inputs.
- Harder generalization: At inference-time, the model might see sentences with different lengths (larger values) than on the training set, making predictions unstable.
🤔 How about instead of appending the index, linearly splitting the 0-1 interval and appending each corresponding value to each input?
- The main issue with this approach is time-step inconsistency across different input length: An input of 10 elements has a step-size of
, while an input of 20 elements has .
Ok, so it seems it won’t be so easy-peasy lemon-squeezy, let’s see how the pros do it!
Sinusoidal positional encodings
This was the approach taken by the original transformer architecture (2017). Given a list of
3
After the addition, each token embedding contains information on which position it holds in the sequence.
Intuitively explained
Each row of the matrix can be seen as a “timestamp”. A bit similar to what a number of
4 For reasons similar to what I explained in Tip 1.
I give further interpretability of this in Tip 2 and explain advantages of this design choice in @Relative-position-awareness. But before that, let’s first formalize everything.
Mathematical definition
Each row of this matrix is a vector defined by:
Where:
They call
How does this matrix actually look like?
This
There are many ways of thinking about this, here I put a couple I like.
The clocks analogy
Given a
As
As we move through time (advance down the rows of the matrix): The first clocks move very fast (like seconds), and then each one moves progressively slower (minutes, hours, …) as we go to the right. Notice that now the “offset” value
This is exactly the same as we are doing with the definition of the positional embedding! But since clocks are not numbers6, we store
Binary encoding analogy
On a different note, we can also see a similar pattern when writing numbers in binary.
Consider 4-bit binary numbers:
We have that the least significant bit (rightmost) changes every time, the next one every 2 times, then every 4, then every 8, etc. In essence, we have the same pattern as before, where one coordinate changes very frequently, and the next one less so, and so on.
🤔 Why don’t we do this for our positional encodings?
In summary, binary embeddings don’t have some of the nice properties of sinusoidal embeddings (which we’ll see shortly). In addition, there is no big reason to use something as discrete as 0’s and 1’s to encode our positional information when everything else are floating point numbers. In machine learning, we usually prefer to work with smooth, continuous, and differentiable stuff, to make things easier for the optimizer.
6 Because they are clocks, which is a totally different thing.
5 Imagine the clocks only have one hand
Sinusoidal positional encodings present some very nice properties: uniqueness, boundedness, absolute-position-awareness, relative-position-awareness… I’d like to develop more on relative position awareness property: there exists a time-independent linear function transformation mapping the embedding from time
Relative position awareness
Given the coordinates
In particular the matrix
In linear algebra, we refer to a matrix of type:
as a rotation matrix. It performs a rotation operation in the euclidean space: It is a transformation which rotates the
In this case, however, we have a matrix which looks like:
It is still a rotation matrix, but it applies the rotation clock-wise instead. It is easy to see since:
As the authors put it: “We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset
The hypothesis is that this linearity makes it easier for the model to understand relative distances between positions: The same function maps the difference between position
This all looks very cool and all, but something that always bothered me was:
🤔 Why summing this matrix to the input embedding? Wouldn’t it be less destructive to just concatenate it?
It’d be less destructive, but we’d need double the embedding dimension, wasting a lot of memory in the process.
The counter-argument is that this addition is taken into account in the training process. Most likely, the model will not include as much information in the first dimensions as it does in the later ones. Given that the first dimensions’ values oscillate much faster with input position.
RoPE: Rotary Position Embedding
While sinusoidal positional encodings work well, they present some limitations when it comes to implementing more advanced/efficient versions of the transformer7. The paper RoFormer addresses them by introducing RoPE.
7 For instance: they aren’t as meaningful when compressing subsequences in a single context, or when breaking up sequences across contexts, or when using kernelized variants of the attn mechanism.
RoPE happens directly at the attention mechanism instead of the positional embedding at the beginning.
Intuitively explained
RoPE’s core idea is actually very close to the sinusoids. However, instead of adding a matrix of “rotations” as before, we are directly modify the query
and key
embeddings. We do this modification right before the dot-product operation. Given a query
embedding8, they also do the split-in-pairs thingy of the sinusoids vectors, treat each pair of elements as coordinates in the 2D plane, and apply a rotation to these coordinates.
8 They do exactly the same for key
embeddings, just focussing on queries
for readability.
The angle of the rotation is given by both: the element’s position in the sequence, and the coordinate offset
queries
at position keys
.Mathematical definition
When dealing with 2D stuff, it is often helpful to express things in terms of complex numbers
9 Of course, we could do everything in
Expressing complex numbers
Remember that a complex number
Where
Similarly, we can express it by its modulus
We can find the equivalence of both expressions through basic trigonometry:
Or:
Applying rotations
Imagine we want to “rotate” a complex value
Here it is very nice to simply use its exponential form and multiply by
This can also be seen through rotation matrices in
Where we usd the sine and cosine sum formulas:
We can further simplify the previous formula using
Which, if we express in matrix formulation:
Which, again (see Tip 3) is a rotation matrix. This time it applies a counterclock-wise rotation.
More formally, given an embedding
We do the “pairing” of contiguous dimensions, this time expressing the values as complex numbers:
We have
The RoPE operation is defined by applying a rotation of
Where
Notice we can equivalently express the RoPE operation as a linear function without using complex numbers. Given the query (or key) embedding