Apparently, mosquitoes 🦟 —when flying at night— rely on a source of light for navigation. Natural sources of light tend to be very far away 🌘. Thus, mosquitoes can follow a straight line by simply keeping a constant angle with them. However, artificial lights 💡 break this system: if one keeps a steady angle with a close object, one ends up going in circles…

That being said, today I want to explore various concepts around “sequence modeling”. Let’s start by formalizing it a bit. We’ll consider two sequences:

Input sequence: $X = [{\vec{x}}_{0}, . . ., {\vec{x}}_{T}]$
Output sequence: $Y = [{\vec{y}}_{0}, . . ., {\vec{y}}_{T}]$

Where ${\vec{x}}_{t}, {\vec{y}}_{t} \in R^{f} \forall t$ , and the order within ${(x_{0}, y_{0}), (x_{1}, y_{1}), (x_{2} . y_{2}), \dots}$ matters (each element might be dependent on the previous ones). This posts compares different ways of computing the mapping:

$X \to Y$

I represent 1D tensors as rectangles. We can stack them to create 2D tensors X, Y.

I’ll try to provide intuitive understanding of how common models work, how they relate to one-another¹ and their computational algorithmic complexity.

¹ Mainly through the lenses of RNNs, because of their easy interpretability.

Tip 1: About this post

Focus

I’ll be mainly focusing on the computational bottleneck part of each model: the component performing attention, recursion, selectivity…
I’ll mostly ignore feed-forward, normalization and other embarrassingly-parallelizable steps.

Structure

Whenever it is interesting, I’ll split model formulation between:

Train-time: Where I assume the whole input sequence is available. Here we are usually interested in processing the whole sequence in a single step (as in not recurrent) to leverage the fact of having the complete sequence available.
Inference-time: I’ll focus in the case where the complete sequence is not available. For instance: if processing a stream of data (e.g real-time audio transcription), or running the model in an autoregressive manner (e.g. next-token prediction).

Notation

$T$ : Length of the sequence at train time
$t$ : Length of the sequence at inference time $t \in [0. . T]$
$f$ : Feature dimension, usually both input-output dimension and hidden state size (if applicable, otherwise specified). It is assumed $f ≪ T$ .

Inspiration

I’ll talk about a lot of papers (specially in the optimization attempts of transformers), but the three main inspiration sources are:

Narrative

There are so many details to comment that the big picture of the post might get a bit lost. Here is what I was going for:

Let’s tackle the problem $X \to Y$ , where $X, Y$ are sequences.

What about $y_{t} = f_{ω} (x_{t}) \forall t$ ? (Feed Forward)
- Too simplistic, we are not using the fact that the data is ordered in a meaningful way.
What about $y_{t} = f_{ω} (x_{t}, x_{t - 1}, \dots, x_{t - k}) \forall t$ ? (1D CNN)
- We’d need to concatenate many of those layers for context to flow from the beginning ( $x_{0}$ ) to end ( $y_{T}$ ).
What about $y_{t}, S_{t} = f_{ω} (x_{t}, S_{t - 1}) \forall t$ ? (RNN)
- In general this is not parallelizable (super slow training), the network forgets information, we have vanishing gradients problem.
What about making $S_{t}$ the concatenation of all previously seen $x_{t}$ , i.e $S_{t} = {x_{t}, x_{t - 1}, \dots, x_{0}}$ ? (Transformer)
- This works great but has a quadratic cost at train time, and linear cost at inference-time.
What about a more efficient approximation of the previous one?
- Nice ideas but not as good as the softmax transformer.
Let’s go back to the RNN idea, isn’t there a way to train it more efficiently? (Mamba1)
- Yes, if we add some structure.
Can’t we run it faster? (Mamba2)
- Yes, if we add even more structure. Wait, we re-discovered a more general version of the linear transformer coming from the SSM branch!

In the following table you can see the memory and flops costs of the main models we’ll cover:

TLDR of this post (I still recommend to read it tho haha). I removed the $O (\cdot)$ notation for readability. I marked in bold the computationally-problematic things (mainly Softmax Transfomers). We assume $T ≫ f$ . *“One-step-computable”* refers to whether given $x$ we can obtain $y$ directly without iterating through the sequence (assuming enough parallel compute power is available). $⋆$ ² $⋆ ⋆$ ³ $⋆ ⋆ ⋆$ ⁴
Algorithm	Train		Inference		One-step computable	Global context
	Memory	FLOPs	Memory	FLOPs
FF	$T f$	$T f^{2}$	$f^{2}$	$f^{2}$	🟢	🔴
1D CNN	$T f$	$T f^{2}$	$f^{2}$	$f^{2}$	🟢	🔴
Standard RNN	$T f$	$T f^{2}$	$f^{2}$	$f^{2}$	🔴	🟡 $⋆ ⋆$
Naive Softmax Transformer	$T^{2}$	$T^{2} f$	$t f$	$t f$	🟢	🟢
Flash-Attn Softmax Transformer	$T f$	$T^{2} f ⋆$	$t f$	$t f$	🟢	🟢
Mamba1	$T f^{2}$	$T f^{2}$	$f^{2}$	$f^{2}$	🟡 $⋆ ⋆ ⋆$	🟡 $⋆ ⋆$
Linear Transformer & SSD (Mamba2)	$T f$	$T f^{2}$	$f^{2}$	$f^{2}$	🟢	🟡 $⋆ ⋆$

² But in practice much faster, thanks to fused kernel.

³ Because information gets compressed

⁴ Heavily optimized with strong structure and the scan operation but still not one-step.

But why is sequence modelling a challenging problem?

High dimensionality: Sequences can be very long⁵ which posits memory issues.
Iterative nature: Since the ordering of the datapoints is relevant, a lot of algorithms cannot be parallelized without heavy memory and computational drawbacks, or strong forced structure. It doesn’t matter how nice your GPU is if you need to process one-element-at-a-time your sequence.

⁵ Enterprise code repositories can easily be in the order of 100k lines of code, a single second of audio has 44k datapoints (if recorded at a standard 44kHz), the human genome has 3.1 billion base pairs.

Bad combo… Still, there are several smart methods to be able to overcome these limitations. Put on your dancing shoes 🩰 because the show is about to start!

Feed-forward (FF)

Alright let’s get this over with! What’s the easiest thing we can do? 🤔.

Given $x, y \in R^{T \times f}$ we could map $x_{t} \to y_{t}$ by just doing⁶:

⁶ I include this basic model to establish a computational lower bound and provide context for more sophisticated approaches.

$y_{t} = f_{ω} (x_{t}) \forall t$

Each $y_{t}$ is computed only with its corresponding $x_{t}$

At the simplest, $f_{ω}$ can be a linear projection⁷: $f_{ω} (x) := A x$ where $A \in R^{f \times f}$ . In this case, the “computational bill” at train-time becomes $O (T f^{2})$ FLOPS since we have to perform $T$ matrix-vector multiplications of size $f \times f \cdot f$ . In terms of memory, allocating the input and output are the main bottleneck, thus we have a cost of $O (T f)$ ⁸.

⁷ We can also add some non-linearity and compose multiple functions to increase modelling power.

⁸ In language modelling this is known as a 2-gram model, in which case $Y = X [1 :] + <eos>$ (next-token prediction)

This has a great computational appeal: It is extremely parallelizable at train-time, and inference can be done within constant time and memory. However, its limited modeling capabilities make it insufficient for most real-world applications: Mainly, it doesn’t leverage the sequential nature of the data! There is no information flow within elements of the sequence.

Still, layers as such play an important role on more complex models. They allow to non-linearly combine internal features, feature normalization, and store model knowledge ⁹.

⁹ I will not spend more time with this as it is the part of the models I would not be bothered with Tip 1.

Convolutional Neural Network (1D-CNN)

Ok, so how can we do better? 🤷

Instead of mapping element-wise each input of the sequence to an output, we could map a fixed $k$ -size sliding window of inputs to an output:

$y_{t} = f_{ω} (x_{t}, x_{t - 1}, \dots, x_{t - k})$

Each $y_{t}$ is computed only with a fixed window ox $x_{t}, . . . x_{t - k}$

This keeps most of the computational appeal from the previous idea while also allowing us to locally transfer information along $x$ .

Tip 2: Global context through hierarchical feature extraction

Interestingly, we can sequentially compose multiple of this type of layer to propagate information through longer time-spans:

$y = f_{ω_{L}} \circ \dots \circ f_{ω_{2}} \circ f_{ω_{1}} (x)$

With each successive convolution, there is a cumulative aggregation of local features, which captures information from larger, more general aspects of the input, finally yielding a global understanding of it. In particular, if we have an input of length T, we’d at least need $L = \frac{T}{k - 1}$ number of layers so that the whole input has an influence on the whole output. I.e. $x_{0}$ is considered on $y_{T}$ .

Representation of how long it takes for $x_{0}$ to have an influence on $y_{T}$

This is very intuitive in 2D CNNs used for vision: The first layers extract the position of very simple features, such as edges. As the input advances through the network, these features get combined into more and more complex patterns. In the last layers, features become recognizable common shapes such as eyes, wheels, roofs… This idea was key in early/small computer vision tasks (e.g: AlexNet) and helped propel the ML field. Now, transformers (e.g. Vision Transformer) perform it much more effectively by directly allowing all-to-all input interactions at each layer. This means all inputs have influence on all outputs at each time-step (let’s forget about causality for now), removing the need of very deep networks. I provide more intuitive understanding of this in my post about deep dream.

This approach presents other disadvantages: like being time/position invariance (fixed kernel parameters $ω$ across the whole sequence regardless of the input values), vanishing gradients for very deep networks (partly solved by residual connections).

Overall, not being designed for long-context information transmission makes them a bad candidate for the studied problem. As in FFN though, some modern sequence models include CNNs to locally combine features (see Mamba models) or to compress sequence temporal dimensionality (e.g. first layers of Whisper model for speech-to-text).

Recurrent Neural Network (RNN)

Hmmm, so how can we more effectively transfer information through time? 💭

We can store some internal state containing the relevant information that needs to be transmitted across time: $S_{t}$ ¹⁰ We would apply our model sequentially like so:

¹⁰ $S$ as in “state” at time $t$ . Also known as $h_{t}$ for “hidden state”.

$S_{t}, y_{t} = f_{ω} (S_{t - 1}, \vec{x_{t}})$

Each $y_{t}$ is compued from $S_{t - 1}$ and $x_{t}$

This is a very generic representation and there exist multiple ways of implementing it (as we’ll later see). In Tip 3 I summarize a couple of influential -now classic- RNNs.

Tip 3: Classic RNNs

Here I present two of the most relevant functional forms for $f_{ω}$ , now rarely used because of the limitations I explain later on in the post.

GRU

GRUs use two gates (an update gate and a reset gate) to regulate the flow of information. The **update gate** balances between carrying forward previous hidden states and incorporating new information, while the **reset gate** decides how much of the past context to forget.

LSTM

Notice that LSTMs maintain two separate states: an internal cell state and a hidden state to better isolate and preserve long-range information. $S_{t}$ in this case can be seen as the concatenation of both $c_{t}, h_{t}$ . On the other hand GRUs combine these into a single hidden state and rely on fewer gates. This design makes GRUs simpler and faster to train, but LSTMs can sometimes capture longer dependencies more effectively due to the separate cell state.

In their classic form (constant $S_{t}$ size and non-parallelizable sequence training) they presented several drawbacks which made them obsolete for big problems:

Non-parallelizable sequence training becomes prohibiting for long sequences. The parallelization power of GPUs is lost if the computation of $S_{t}$ is blocked by the computation of $S_{t - 1}$ . One can still parallelize along the batch dimension but weight updates are still too slow in comparison to the other methods.
The fixed state size $S_{t}$ might be too small to compress all relevant information of the sequence, resulting in forgetting problems.
Vanishing gradient problems which arise from back-propagation-through time. In the backward pass, for each step $t$ we compute the gradient as: $g_{t} = g_{t + 1} \cdot J_{t}$ . Where $J_{t}$ is the Jacobian matrix of step $t$ . Consequently, as usually $J_{t}$ is contractive¹¹, its cumulative product decays or “vanishes” with sequence length. To the point of having near-zero effect after few iterations. This results in the network struggling to learn dependencies from earlier inputs.¹².

¹¹ I.e. it has eigenvalues $| λ | < 1$ .

¹² Solutions include: LSTM (incorporate gating mechanism to allow for longer-range dependencies), ReLU activation (instead of Sigmoid or tanh, which have very small derivatives if values are far from 0), gradient clipping (to prevent them being to small or too large), or layer normalization (help stabilize gradients)

We’ll later see that we can re-work either the functional form of $f_{ω}$ or the definition of $S_{t}$ and bypass those limitations.

Transformer Decoder

Are we still doing the rethorical question thing? Yes. Ok, so if compressing information doesn’t work, what could we do instead?

We can simply consider the complete sequence for each guess! This improves quality issues, at higher memory and computation costs (obviously). This is how Transformers do it:

Train-time

Given the input $X \in R^{T \times f}$ ¹³, we linearly project it into key, query, values¹⁴: $\begin{aligned} Q = {LINEAR}_{ω_{q}} (X) \in R^{T \times f} \\ K = {LINEAR}_{ω_{k}} (X) \in R^{T \times f} \\ V = {LINEAR}_{ω_{v}} (X) \in R^{T \times f} \end{aligned}$

¹³ I’ll focus on self-attention, and, for simplicity, I assume all projections are done into a space of $f$ -dimensionality.

¹⁴ I provide interpretability of those in my Attention Mechanism Zoo post

And we then apply dot-product attention:

$Attention (Q, K, V) = softmax (\frac{L \circ (Q K^{T})}{\sqrt{f}}) V$

Where the $softmax$ is applied row-wise. And $L \in R^{T \times T}$ is a lower-diagonal matrix used for causality masking. This is the bill of naively implementing this:

Operation	Memory	FLOPs
Q,K,V projections	$T f$	$T f^{2}$
Computing and allocating $Q K^{T}$	$T^{2}$	$T^{2} f$
Masking & Softmax	$1$ ¹⁵	$T^{2}$
Values	$T f$	$T^{2} f$
Total	$O (T^{2})$	$O (T^{2} f)$

¹⁵ As in no extra space is needed

¹⁶ My research crush.

Luckily Tri Dao¹⁶ & company introduced a way around it in 2022 Tip 4.

Tip 4: Flash attention

FlashAttention (May 2022) introduces two key ideas:

Less memory usage because of not materializing QK in memory.
Faster execution because of fused kernel.

The idea is to substitute the PyTorch operations (or whichever deep learning framework is being used) by a custom CUDA kernel ¹⁷ (aka Fused Kernel) which combines and performs them more efficiently:

Common flow of operations done in PyTorch. Image from here.

Flow of operations in a memory-aware fused CUDA kernel. Image from here.

Consider: $SOFTMAX (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$

This (obviously) results into faster and more memory-efficient attention method. The approach not only combines the operations but also better utilizes the GPU memory allocation. For instance, an A100 GPU has:

40GB (or 80 GB) of HBM (High-Bandwidth-Memory): Large but slow: This is implemented by stacking multiple DRAM (Dynamic Random Access Memory) dies allowing high parallel data transfers.
~0.2MB x 108 processors of SRAM (Static Random Access Memory): Small but fast: The speed advantage of SRAM over DRAM comes from SRAM’s ability to hold a given data bit in a static state (on or off) as long as power is supplied. Moreover, it is more reliable than DRAM. DRAM must refresh its stored data bits many times per second to maintain the integrity of the data stored (making operations slower). However, SRAM presents a much higher cost-per-bit and and requires more physical space on the chip. Thus, it is usually reserved for operations where speed and reliability is critical (such as CACHE in CPUs or GPUs). .

¹⁷ Fancy way of saying: function that runs on GPU written in CUDA

Inference-time

Imagine we have cached¹⁸ $K_{0 : t - 1}, V_{0 : t - 1}$ . Then, at time $t$ with input $\vec{x_{t}}$ , we only need to compute:

¹⁸ I’ll focus in the case where we use KV-caching

$\begin{aligned} \vec{q_{t}} = {LINEAR}_{ω_{q}} (\vec{x_{t}}) \in R^{1 \times f} \\ \vec{k_{t}} = {LINEAR}_{ω_{k}} (\vec{x_{t}}) \in R^{1 \times f} \\ \vec{v_{t}} = {LINEAR}_{ω_{v}} (\vec{x_{t}}) \in R^{1 \times f} \end{aligned}$

Then, using the cache:

$K_{0 : t} = [\begin{matrix} K_{0 : t - 1} \\ \vec{k_{t}} \end{matrix}] \in R^{t \times f}, V_{0 : t} = [\begin{matrix} V_{0 : t - 1} \\ \vec{v_{t}} \end{matrix}] \in R^{t \times f}$

We can then compute the new attention output vector:

${Attention}_{t} (Q, K, V) = softmax (\frac{\vec{q_{t}} K_{0 : t}^{T}}{\sqrt{f}}) V_{0 : t}$

1. We compute $q_{t}, k_{t}, v_{t}$ from $x_{t}$ .
2. How is $q_{t}$ related to all previous $K$ ’s? We do so by computing the dot-product with each of them (we use the cached $K_{0. . t - 1}$ keys). We then apply $softmax$ for normalization and we call the result “attention vector”.
3. The final result is the weighted average of values according to the attention vector obtained in step 2¹⁹ (we use the cached $V_{0. . t - 1}$ values).

Computationally the cost comes down to these steps:

Operation	Memory	FLOPs
q,k,v projections	$f$	$f^{2}$
Computing and allocating $\vec{q_{t}} K_{0 : t}^{T}$	$t$	$t f$
Softmax	1	$t$
Values	$f$	$t f$
Total	$O (t)$	$O (t f)$

Transformer Decoder through the RNN lenses

Notice that we can see the KV-cache as the internal state $S_{t}$ of the model. The particular thing about this model is that we have an internal state which grows linearly with the sequence: it takes $O (f t)$ memory.

Always having access to all past tokens is a key characteristic of transformers (both good and bad):

They don’t forget (in contrast to fixed-size state of common RNNs).
Each layer has global context (in contrast to CNNs, whose layers have local context and need to be very deep in order to extract global input features)
They have a growing memory, making them unappropriated for long sequences.

Linear Transformers

Uff, can’t we do some kind of approximation which is almost-as-good but at a much-lower computational cost? 😖

We can try! The choice of the $softmax$ function as a non-linearity / normalization in the attention mechanism might initially seem a bit arbitrary (and maybe it was). However, as we will explore in this section, it has proven to be crucial for the performance of transformers and extremely challenging to improve upon.

There have been many attempts to make the standard transformer architecture more computationally efficient (see Tip 5 ²⁰).

²⁰ This could be a post on itself, here I just go over some interesting ideas.

Tip 5: Transformer optimizations

Lowering compute complexity of the original transformer can yield many benefits: faster processing and ability to process larger context windows. Despite the merit of the approaches we’ll review, usually the benefits obtained get undermined by the quality deterioration due to the approximations made.

Still, it is worth understanding the efforts made, some of them are quite neat, and might inspire future methods:

Mapping of some relevant attempts of reducing transformer complexity. paper

Here I list some relevant papers²¹:

Linformer leverages the empirical idea that the attention matrix is more-or-less low-rank (they look at the eigenvalue distribution). So they use the Johnson–Lindenstrauss lemma to approximate the attention matrix. Fanciness aside, this lemma states that: If we use a random projection matrix to project a set of points onto a lower dimension, the pairwise distances are approximately preserved. In practice they compress both keys and values along the time axis into a fixed dimension: $K \in R^{T \times f} \to K \in R^{k \times f}$ and $V \in R^{T \times f} \to V \in R^{k \times f}$ . Then, the attention matrix has a shape of $Q K^{T} \in R^{T \times k}$ (linear in time).
- Complexity: $O (T)$ memory and time (ignoring $f$ ).
Nyströmformer also leverages the low-rank assumption of the attention matrix. It smartly uses the Nystrom method to approximate the attention matrix by using a subset of Q, K rows.
- Complexity: $O (T)$ memory and time (ignoring $f$ ).
Reformer They use locality-sensitive hashing to reduce the amount of dot-products. However, keys and query values need to be identical, which limits its modelling power and its usage for cross-attention tasks.
- Complexity: $O (T \log T)$ memory and time (ignoring $f$ ).
Sparse Transformer: Does a Sparse factorization of the attention matrix.
- Complexity: $O (T \sqrt{T})$ memory and time (ignoring $f$ ).
Big Bird: Applies global attention to a few tokens, combined with local attention and random connections for the rest to reduce the dimensionality of the attn matrix.
Transformers are RNNs: Use the reverse kernel trick to approximate the softmax attention. The following section expands on this idea.
- Complexity: $O (T)$ memory and time (ignoring $f$ ).

Here are some other relevant methods with interesting ideas, if I have time one of these days I’ll add a two-sentence explanation on them.

²¹ I recommend this post and this other post to start going into this 🐇 hole.

However, for the purposes of today’s blog, in this section I’ll focus ons Transformers are RNNs paper. Before we jump in, let’s first make sure we are on the same page on the kernel trick Tip 6.

Tip 6: Kernels and Tricks (no, not the CUDA ones)

Kernel function

Given $x, y \in R^{n}$ . We say that $K : R^{n} \times R^{n} \to R$ is a kernel if there exists another function $ϕ : R^{n} \to R^{m}$ which certifies:

$K (x, y) = ϕ (x) \cdot ϕ (y)$

We (humans) have found a few of these kernels. For instance, the Polynomial Kernel:

$K (\vec{x}, \vec{y}) = ({\vec{x}}^{T} \vec{y} + c)^{d}$

Where, for instance if we take $d = 2$ we have that the projection function $ϕ (x)$ is:

$ϕ (x) = [\begin{matrix} x_{n}^{2} \\ ⋮ \\ x_{1}^{2} \\ \sqrt{2} x_{n} x_{n - 1} \\ ⋮ \\ \sqrt{2} x_{n} x_{1} \\ \sqrt{2} x_{n - 1} x_{n - 2} \\ ⋮ \\ \sqrt{2} x_{2} x_{1} \\ \sqrt{2} c x_{n} \\ ⋮ \\ \sqrt{2} c x_{1} \\ c \end{matrix}]$

Another typical one is the Gaussian RBF Kernel:

$K (x, y) = \exp (- \frac{∥ x - y ∥^{2}}{2 σ^{2}})$

Here we have that the explicit feature mapping $ϕ (x)$ is infinite-dimensional, but the kernel represents the inner product in this infinite-dimensional space, which can be very powerful.

The Kernel Trick

In Machine Learning we call “the kernel trick” the usage of a kernel function $K$ to “simulate” the projection of data into a higher-dimensional space. Usually it is easier to linearly split datapoints in high dimensional spaces through nonlinear projections. However, explicitly projecting onto higher-dimensional spaces and calculating the dot-product (as we would need to do if we naively calculated $ϕ (x) ϕ (y)$ ) is computationally expensive.

In essence, it is a way to minimize computations.

Softmax as a kernel function

Consider a vanilla linear transformer²²:

²² Linear because all dependencies are linear, we remove the $softmax$

$Attention (Q, K, V) = Q K^{T} V$

Where $Q, K, V \in R^{T \times f}$ . Traditionally, (as I explained before) we’d multiply the matrices in this order:

$Attention (Q, K, V) = (Q K^{T}) \cdot V$

However, computing $Q K^{T} \in R^{T \times T}$ and materializing the output requires $O (T^{2})$ memory 😞

But now we can do better! Since we don’t need to run the $softmax$ , using the associative property of matrix multiplication, we can compute $K^{T} V$ first:

$Attention (Q, K, V) = Q \cdot (K^{T} V)$

Since $(K^{T} V) \in R^{f \times f}$ it only requires $O (f^{2})$ memory! Once we multiply by $Q$ , it ends up being $O (T f)$ memory. For large sequences this massively reduces computational and memory costs. As hinted before though, this vanilla implementation doesn’t work as good as the softmax version of the transformer²³.

²³ More rigorous studies of this here

Ok, so that doesn’t work… What can we do about it though? Wouldn’t it’d be quite nice if $softmax$ was a kernel function and there existed some $ϕ$ (applied row-wise) such that:

$softmax (Q K^{T}) = K (Q, K) = ϕ (Q) ϕ {(K)}^{T}$

We would then be able to write:

$Attention (Q, K, V) = ϕ (Q) \cdot (ϕ (K)^{T} V)$

And get all the gains I explained before²⁴.

²⁴ Notice we are applying the kernel trick in a reversed way as usually 🤯

²⁵ Intuitively: $softmax$ (same as the RBF kernel) has an exponential, which has an infinite Taylor expansion.

So yeah… That would be quite nice, but sadly we don’t live in happyland where unicorns bring you lollipops as for lunch 🦄🍭. There is no free lunch! $ϕ$ would need to be infinite-dimensional²⁵, which is quite counter-productive in this case haha.

Tip 7: Which

ϕ

should we use?

We can choose other $ϕ$ functions though. In the paper Transformers are RNNs, they experimentally show that row-wise applying

$ϕ (x) = elu (x) + 1$

has a performance on par with standard softmax transformers while significantly reducing computational and memory requirements.

Those are all very cool ideas! We’ll now break down the complexity at train/inference times, and try to see it through the RNN lenses 🕶️.

Train-time

As I explained in the previous section, we essentially just need to compute:

$Attention (Q, K, V) = ϕ (Q) \cdot (ϕ (K)^{T} V)$

Computationally:

Operation	Memory	FLOPs
Q,K,V projections	$T f$	$T f^{2}$
Apply $ϕ$	1	$T f$
Comput $ϕ (K)^{T} V$	$f^{2}$	$T f^{2}$
Multiply by $ϕ (Q)$	$T f$	$T f^{2}$
Total	$O (T f)$	$O (T f^{2})$

That is very nice if we will always have all the sequence available both at train and inference times. Usually however, we’ll want to hide the future from current and past observations: Tip 8.

Tip 8: Causal Masking

How does this work? We clearly can’t just multiply by a triangular matrix $ϕ (Q) \cdot ϕ (K)^{T}$ since we don’t materialize it now:

$Attention (Q, K, V) = (L \circ (Q K^{T})) \cdot V$

Would force us to compute $ϕ (Q) \cdot ϕ (K)^{T}$ , defeating the purpose of ths approach. It is interesting to write it down tho, since it’ll come up later in the blog 😉.

For now, let’s take a step back, let’s define:

$V^{'} := softmax (\frac{Q K^{T}}{\sqrt{f}}) V$

More generally, instead of the $softmax$ we can have any similarity function $sim$ , which, applied to a particular time (aka row):

${V_{t}}^{'} = \frac{\sum_{τ}^{T} sim (\vec{Q_{t}}, \vec{K_{τ}}) \vec{V_{τ}}}{\sum_{τ}^{T} sim (\vec{Q_{t}}, \vec{K_{τ}})}$

Notice in the softmax transformer, we have that $sim (\vec{q}, \vec{k}) = e^{\frac{\vec{q} \cdot \vec{k}}{\sqrt{f}}}$ . An easy way to interpret this formulation is the following:

The new value (at time t) is a weighted average of all other values. These weights are given by they affinity between the query (at time t) and each of the keys.

It is easy to follow now that if we want to avoid future observations affecting our current value, we just need to limit the sum term up to $t$ :

${V_{t}}^{'} = \frac{\sum_{τ}^{t} sim (\vec{Q_{t}}, \vec{K_{τ}}) \vec{V_{τ}}}{\sum_{τ}^{t} sim (\vec{Q_{t}}, \vec{K_{τ}})}$

In our currently-studied case though, since $sim (\vec{q}, \vec{k}) = ϕ (q) \cdot ϕ (k)^{T}$ , we have:

${V_{t}}^{'} = \frac{ϕ (\vec{Q_{t}}) \sum_{τ}^{t} (ϕ (\vec{K_{τ}})^{T} \cdot \vec{V_{τ}})}{ϕ (\vec{Q_{t}}) \sum_{τ}^{t} ϕ (\vec{K_{τ}})^{T}}$

Inference-time

Following the derivation presented in Tip 8, we have that:

${V_{t}}^{'} = \frac{ϕ (\vec{Q_{t}}) \sum_{τ}^{t} (ϕ (\vec{K_{τ}})^{T} \cdot \vec{V_{τ}})}{ϕ (\vec{Q_{t}}) \sum_{τ}^{t} ϕ (\vec{K_{τ}})^{T}}$

Here it is useful to define these matrices:

$S_{t} := \sum_{τ}^{t} ϕ (\vec{K_{τ}})^{T} \cdot \vec{V_{τ}}$

$Z_{t} := \sum_{τ}^{t} ϕ (\vec{K_{τ}})^{T}$

Notice that $S_{t} \in R^{f \times f}$ and $Z_{t} \in R^{f \times 1}$ . We then have:

${V_{t}}^{'} = \frac{ϕ (\vec{Q_{t}}) \cdot S_{t}}{ϕ (\vec{Q_{t}}) \cdot Z_{t}}$

What is cool about this is that both $S_{t}$ and $Z_{t}$ can be computed in constant time from $S_{t - 1}$ and $Z_{t - 1}$ respectively. We just need to add the projections from the last input:

$S_{t} = S_{t - 1} + ϕ (\vec{K_{t}})^{T} \cdot \vec{V_{t}}$

$Z_{t} = Z_{t - 1} + ϕ (\vec{K_{t}})^{T}$

Computationally, at each time-step we have $O (f^{2})$ FLOPs and memory 🎉. More visually:

1. We compute $q_{t}, k_{t}, v_{t}$ from $x_{t}$ .
2. We compute the internal states $S_{t}, Z_{t}$ . Check for interpretability in Tip 9.
3. $y_{t}$ is the query times $S_{t}$ normalized by the query times $Z_{t}$ .

Tip 9: Interpretability of

q, k, v, S, Z

I think of itthe following way: Each row of $k_{t} \cdot v_{t} \in R^{f \times f}$ is the value tensor weighted by the key tensor components:

In other words: Each component of the key vector is deciding how influential the value vector is to its corresponding row in the $S$ matrix. $k$ is deciding how to store $v$ in $S$ .

How $S$ rows get upodated with a new input.

Each compnent of the query encodes how relevant its corresponding row of $S$ is for the given input. This means $q S$ returns the weighted average of stored values (which is already a weighetd accumulation of previous-step values as we’ve seen). This get’s normalized by how relevant each component is for the given input (query), times how relevant each row has been so far ( $Z$ ). So, if we are querying something very strange (something that has not been much accumulated), we’ll have both a low $v S$ vector and a low $v Z$ number, so the normalization makes sense.

Oversimplified: $k$ decides where to store each $v$ ²⁶, and $q$ is deciding how to retrieve it.

Interestingly, we can also see $k$ as some kind of selectivity mechanism: If the model decides a particular input $x_{t}$ is not very relevant, $k$ can be close to 0 and it doesn’t affect the hidden state $S$ .

Example

I thought of an example that can help understand the role of each of the components of $k$ : Imagine in a language-modelling task that the first component of $k$ activates (presents a high value) whenever there is a proper noun. The first row of S will be storing proper noun information of the input text. Whenever the model needs to retrieve a proper noun, the query will have a high first component. This will yield a result vector whose components are mostly proper-noun information of the seen text.

²⁶ Notice we can quite analogusly think of it column-wise instead of row-wise

Linear Transformer Decoder through the RNN lenses

We can then see $S_{t}$ , $Z_{t}$ as the internal state of a RNN.

In $S$ we combine the queries and values in a single matrix. Through time we keep adding stuff (which gets normalized by $Z$ ).

State Space Models (SSMs)

For a much more in-depth analysis of SSMs, check out post about Mamba models. I’ll focus on Mamba1 and Mamba2, SSMs which are:

Structured: Matrix $\overset{―}{A}$ is forced to take a particular form (diagonal for Mamba1, scalar-times-identity for Mamba2).
Selective: Model parameters are dependent on the input for each time-step.

Mamba1 (S6)

Train-time

I’ll just focus on inference-time interpretability since train-time gets a bit deep and I already explored it in the Mamba post. The TLDR is that they develop a custom kuda kernel, they call scan operation, which efficiently computes the outputs, provided a complete input sequence.

Inference-time

Given $x_{t} \in R^{1}$ ²⁷, Mamba1’s SSM layer does the following operation:

²⁷ Note: Usually $x_{t} \in R^{d}$ and different SSM “heads” are used for each dimension. Everything I explain here is just broadcasted along the $d$ -dimensions of the input (as if it was batched).

$\begin{aligned} B_{t} = {LINEAR}_{B} (x_{t}) \in R^{f} \\ C_{t} = {LINEAR}_{C} (x_{t}) \in R^{f} \\ Δ_{t} = {LINEAR}_{Δ} (x_{t}) \in R^{f} \end{aligned}$

Remember they use a pre-fixed $A$ matrix. The same one as introduced n the Diagonal State Spaces (DSS) paper. Applying the discretization step, we obtain:

$\begin{array}{r} {\overset{―}{A}}_{t} \in R^{f \times f} \\ {\overset{―}{B}}_{t} \in R^{f \times 1} \\ {\overset{―}{C}}_{t} \in R^{1 \times f} \end{array}$

For the recurrence to be efficiently computable ${\overset{―}{A}}_{t}$ is restricted to be diagonal. This is called structured matrix hence structured SSM. Therefore, we can store and manipulate the diagonal elements only as if it was a vector: ${\overset{―}{A}}_{t} \in R^{f}$

We then apply the recurrence relation:

$\begin{aligned} h_{t} & = {\overset{―}{A}}_{t} h_{t - 1} + {\overset{―}{B}}_{t} x_{t} \\ y_{t} & = {\overset{―}{C}}_{t}^{T} h_{t} \end{aligned}$

Computationally:

*Remember though that this s just for one component, we need to do ths for each component of ${\vec{x}}_{t} \in R^{d}$ . Thus everything gets multiplied by $d$ (where usually: $d ≃ f$ ).
Operation	Memory	FLOPs
B,C, $Δ$ projections	$f$	$f$
Discretize	$f$	$f$
Compute $B_{t} x_{t}$	$f$	$f$
Compute $A_{t} h_{t - 1}$	$f$	$f$
Compute $C_{t}^{T} h_{t}$	$f$	$f$
Total	$O (f)$ *	$O (f)$ *

Mamba1 through the RNN lenses

It is very straight forward in this case 😂. Thinking in terms of gating mechanisms:

$h_{t}$ is the internal RNN state.
${\overset{―}{A}}_{t}$ controls what components of $h_{t - 1}$ get forgotten.
${\overset{―}{B}}_{t}$ controls how $x_{t}$ gets added to $h_{t}$ .
${\overset{―}{C}}_{t}$ controls what components compose the output.

Mamba2 (SSD)

Mamba2’s SSM layer, presented within the SSD (State Space Duality) framework introduces two key changes wrt Mamba1:

It further restricts the $A$ matrix to be of type scalar-times-identity²⁸:

²⁸ Instead of diagonal.

${\overset{―}{A}}_{t} = a_{t} I$

It directly works with multi-dimensional input-output pairs²⁹:

²⁹ Instead of scalars.

$x, y \in R^{T \times d}$

Train-time

Since we have the sequence $x \in R^{T}$ available. Applying the same operations as before (linear projection + discretization) we can pre-compute:

$\begin{array}{r} a \in R^{T} \\ \overset{―}{B} \in R^{T \times f} \\ \overset{―}{C} \in R^{T \times f} \end{array}$

Now the problem get’s simplified to the point that we can express the input-output mapping directly by a single matrix multiplication! To do so, let’s define a new matrix $L$ :

$L = [\begin{matrix} 1 & 0 & 0 & \dots & 0 \\ a_{1} & 1 & 0 & \dots & 0 \\ a_{2} a_{1} & a_{2} & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋱ & ⋮ \\ \prod_{1}^{T - 1} a_{t} & \prod_{2}^{T - 1} a_{t} & \dots & a_{T - 1} & 1 \end{matrix}]$

We then have that:

$y = (L \circ C B^{T}) x$

Derivation

It is actually very simple to see given the recurrence defined by the discretized SSM problem and the constraints:

$\begin{aligned} h_{t} & = a_{t} h_{t - 1} + {\overset{―}{B}}_{t} x_{t} \\ y_{t} & = {\overset{―}{C}}_{t}^{T} h_{t} \end{aligned}$

Then:

$\begin{aligned} h_{- 1} & = \vec{0} \\ y_{0} & = {\vec{c}}_{0}^{T} (a_{0} \cdot \vec{0} + {\vec{b}}_{0} x_{0}) \\ = {\vec{c}}_{0}^{T} {\vec{b}}_{0} x_{0} \\ y_{1} & = {\vec{c}}_{1}^{T} (a_{1} {\vec{h}}_{0} + {\vec{b}}_{1} x_{1}) \\ = {\vec{c}}_{1}^{T} (a_{1} ({\vec{c}}_{0}^{T} {\vec{b}}_{0} x_{0}) + {\vec{b}}_{1} x_{1}) \\ y_{2} & = . . . \end{aligned}$

On its relationship with linear transformers

Hold your 🐎🐎! Didn’t we see something very similar already??

Yes! The linear transformer has almost the same form!! 🤯

$y = (L \circ Q K^{T}) V$

The linear transformer is a particular case of the SSD framework where $a_{t} = 1 \forall t$ instead of depending on the input $x$ .

This can also be understood as a generalization of the positional encoding: as it is now input-dependent instead of fixed sinusoidal, rotary (RoPE) or whatever. Intuitively: the further away two points are: $x_{t_{1}}, x_{t_{2}}$ , the lower $\prod_{t} a_{t}$ between them will be, and the weaker dependence they’ll have. Of course depending on what is within them.

On memory requirements of Mamba2

Keep your shirt on 👕! Doesn’t this formulation force us to use the quadratic formulation of linear transformers?

Good question and yes! If we do a naive implementation of it. However, the matrix $M := (L \circ C B^{T})$ is highly structured. In particular, it can efficiently be split into sub-blocks and parallelize matrix multiplications avoiding recomputations.

On its modelling power

Cool your jets 🛩️! Isn’t it counter-productive to restrict $A$ to be scalar-times-diagonal?

This is something still being tested³⁰. By having $A$ to be scalar-times-identity we loose the hability to be selective of which components of the hidden states are erased. Given an input $x_{t}$ we proportionally keep all $h_{t - 1}$ components (we either erase, fade or enhance them all equally). However, it looks like you can gain more by simply allowing higher $x$ dimensionality. This is analogous to having more attention heads. Plus, by leveraging better matrix multiplications (more hardware optimized), we get much faster training than Mamba1.

³⁰ Time of writing this is December 2024.

In theory:

Mamba2 training >> Mamba1 training
Mamba1 inference performance > Mamba2 inference performance

In practice:

Early results seem to indicate that the trade-offs taken in Mamba2 seem to be worth it: Mamba2 performs on par or better on early benchmarks³¹.

³¹ More testing needed.

Inference-time

I’ll not dive into inference logic of SDD since it is analogous to the already-covered linear transformer.

Epilogue

Alright, that got a bit out of hands (unsurprisingly) but it was useful for me to connect some ideas I had about sequence modelling. See you around!

Post narrative

Pasting it here again to wrap up stuff:

Let’s tackle the problem $X \to Y$ , where $X, Y$ are sequences.

What about $y_{t} = f_{ω} (x_{t}) \forall t$ ? (Feed Forward)
- Too simplistic, we are not using the fact that the data is ordered in a meaningful way.
What about $y_{t} = f_{ω} (x_{t}, x_{t - 1}, \dots, x_{t - k}) \forall t$ ? (1D CNN)
- We’d need to concatenate many of those layers for context to flow from the begining to end
What about $y_{t}, S_{t} = f_{ω} (x_{t}, S_{t - 1}) \forall t$ ? (RNN)
- In general this is not parallelizable (super slow training), the network forgets information, we have vanishing gradients problem.
What about making $S_{t}$ the concatenation of all previously seen $x_{t}$ , i.e $S_{t} = {x_{t}, x_{t - 1}, \dots, x_{0}}$ ? (Transformer)
- This works great but consumes a lot of memory and FLOPs.
What about a more efficient approximation of the previous one?
- Nice ideas but not as good as the softmax transformer.
Let’s go back to the RNN idea, isn’t there a way to train it more efficiently? (Mamba1)
- Yes, if we add some structure.
Can’t we run it faster? (Mamba2)
- Yes, if we add even more structure. Wait, we re-discovered a more general version of the linear transformer coming from the SSM branch!

The end.