Building a low-budget diffusion-based video generation model

Learning about text-conditional diffusion through a toy example.
Miscelaneous
Author

Oleguer Canal

Published

January 12, 2026

Today, after years of procrastination, I finally completed my childhood dream: To build a diffusion-based video generation model that displays text in Comic Sans 🙌

Figure 1: Our model de-noising the word “NOISE” 🤯. Each column contains the generated video at a different de-noising step: from a pure Gaussian sample (step 0) to final output (step 1000).

This blog documents the process, experiments, and learnings. Summarized, we’ll:

  1. Motivate our training dataset: Section 1.
  2. Explain how DDPM1 works: Section 2.1.
  3. Go through our transformer-based architecture: Section 2.4
  4. Have fun with the trained model: Section 3
  5. Review more serious video-generation approaches: Section 4

1 We are working within the Denoising Diffusion Probabilistic Models paradigm.

As much of a joke this might seem, I found this toy example top be a great way to learn about diffusion.

It also serves as a reminder of the power of generative modelling. From nothing more than observing a flattened list of noisy pixels, the model decoded both:

  1. The space-time relationship between them.
  2. Their relationship to the provided prompt.

This internal structure emerged without explicitly forcing it, the model had just one task: guess the noise added to a list of numbers. Optimization pushed it to find the underlying distribution and dynamics of its inputs. A bit like evolution (our optimizer) made us internally build these same structures to effectively operate: make sense of our sensory inputs and act accordingly.

The repo containing all the garbage code I wrote is here. And without further chit-chattin’, let’s get startin’.

The data

Alright, we wanna generate videos but let’s be realistic here… In today’s economy (I’m unemployed) we gotta narrow it down2. What desirable characteristics should the videos have?

2 After all, the idea is to learn how video generation works, not to build anything competitive.

  • Low-Dim: To reduce computation needs.
  • Synthetically generetable: So that amount, diversity, and storage are not a concern.
  • Visually verifiable: I wanna be able to qualitatively assess the generated outputs.
  • Visually appealing: At least minimally haha.
Figure 2: No, we don’t.

The dataset

After throwing all my creativity to this problem, this is the best I came up with: Videos of letters and numbers passing by3 such as the one in Tip 2. Each training datapoint consists of a pair of:

3 And since I’m extremely funny, I decided to waste 5 full minutes of my life making them Comic Sans

  • Video: 16 frames of 32x32 pixels and 1-channel 8-bit data4. Think of this of a tensor of shape \((T, H, W)\)5.
  • Prompt: A combination of the 26 letters of the english alphabet and the 10 digits. Of length up to 6 characters. For instance, the associated prompt of Figure 3 is ['L', 'E', 'T', 'T', 'E', 'R', 'S'].

4 Meaning qwe have 256 levels of brightness / 256 colors.

5 When inpouted to the model we map to floats in the \([-1, 1]\) range.

I made this section collapsible so the moving text isn’t as distracting while reading :)

Figure 3: Letters passing by. Example of the generated video for prompt ['L', 'E', 'T', 'T', 'E', 'R', 'S'].

Implementation considerations:

  • On-the-fly generation: The dataset class generates random combinations of symbols on the fly. I then have a function that takes this combination and converts it into a stack of frames (aka video).

  • Prompt length: Notice prompts can be of different lengths. Long sequences result in text passing very fast, short sequences result in text passing slower6. The model should be able to learn to count symbols and account for that. It’ll be interesting to see if it learns to generate the “slow” version of a letter it has only seen “fast” (we test this in the Section 3). Note that I porovide prompts of fixed length, so there is always padding in the end.

  • No leakage: To ensure we don’t train on test data, I first generate a validation set of \(\approx2000\) combinations with a fixed seed. When generating data for training we discard the points present in our validation set7. The val distribution takes into account the unbalance between the number of combinations of each length8.

6 Because all videos are of fixed 16-frame length.

7 This shouldn’t happen much since most likely the model will converge much much before seeing the 2.2 billion possible combinations. Still, we take testing seriously around here.

8 I also ensured some other combinations are never seen during training to assess its generability power. More on this in Section 3.

Find the implementations of: dataset class, the sequence sampler and the video utils.

The video space has \(256^{32\times32\times16} = 2^{131,072} \approx 10^{39,456}\) points. Way way many more points than atoms there would be if each atom of our observable universe contained another universe inside \(\approx 10^{6,400}\).

Feels like it should be enough.

Either way, in practice, the bottleneck is going to be the number of prompt combinations we can do with our “vocabulary” of 36 symbols. We have:

\[ \sum_{i=1:6} 36^i \approx 36^6 + 36^5 \approx 2.2 \cdot 10^9 \]

If with 2.2 billion potential examples I can’t make the model learn the shape of 36 symbols I think I can change careers. Anyway, it seems that from a data perspective we should be fine.

Figure 4: Game of life simulations would have been interesting.

Include:

  • Game-of-Life kind of systems9. However, I think visually we wouldn’t have been able to verify how accurate the guesses were.

  • Some kind of videos of geometric figures moving around but i didn’t have a clear idea on how to make it interesting.

  • Physics collision simulation or some complex dynamical systems as I did in this paper re-implementation. To be honest, this was in idea from my flatmate that she gave me after I already had coded the letters thing. That’s why I didn’t do it haha

9 Inspired by this very cool paper

The model

Let’s now review the basics of diffusion modelling, the most interesting parts of the chosen architecture, the training and inference.

I use:

  • \(q(\cdot)\) to denote probability distributions which are known or given (e.g. training data, next-diffusion-step sample)

  • \(p_\theta\) to denote modelled probability distributions by a parametrized function with params \(\theta\) (e.g. the model we are training).

Diffusion basics

Today we’ll be studying DDIM10 / DDPM11. The overall idea is to achieve a mapping between our data distribution \(q\) and standard Gaussian12 (as is with most generative modelling paradigms: s.a GANs, VAEs, and Normalizing Flows):

10 Denoising Diffusion Implicit Models

11 Denoising Diffusion Probabilistic Models. DDIM and DDPM are very similar. The main difference is DDIM defines a deterministic reverse step with optional noise (controled by some hyperparameter), whereas DDPM has an inherently stochastic reverse step. More on this in the Section 2.3 section.

12 Not necessarily this distribution but it is common for the usual reasons.

\[ q \leftrightarrow \mathcal{N} (0, I) \]

In diffusion models, these mappings are achieved through two processes:

  1. Forward diffusion process: \(q \rightarrow\mathcal{N} (0, I)\). It is achieved by iteratively adding Gaussian noise to our original data as we’ll see in Section 2.1.1.
  2. Reverse diffusion process: \(\mathcal{N} (0, I) \leftarrow q\). Trains a model which learns to invert the corruptions we added in the forward process: Section 2.1.2.

The generative aspect is given by the following: If one obtains a model \(f_\theta\) capable of inverting corruptions, we can simply sample from \(z \sim \mathcal{N} (0, I)\), and use the reverse diffusion process to map it to a datapoint of the original distribution: \(f_\theta (z) \sim q\). Thus generating new samples of \(q\).

Forward diffusion process

Ok, how can we map our given data \(q\) to standard Gaussain samples? Given a sample from our original distribution13:

13 You can also see this as picking a datapoint form our original dataset if fixed: \(\vec{x_0} \in \mathcal{D}\)

\[ \vec{x_0} \sim q \]

We define the sequence \([\vec{x_0}, ..., \vec{x_T}]\) as such14:

14 Aka “Markov noising chain”. “Markov” because each step depends only on the previous one and is independent on everything else: \(q(x_t​∣x_{0:t−1​}) = q(x_t ​∣ x_{t−1​})\)

\[ \vec{x_t} = \sqrt{\alpha_t} \cdot \underbrace{\vec{x_{t-1}}}_{\text{prev step}} + \sqrt{1 - \alpha_t} \cdot \underbrace{\vec{\epsilon_t}}_{\text{random noise}} \]

Where:

  • \(\vec{\epsilon_t} \sim \mathcal{N}(0, I)\).
  • And \([ \alpha_1 , ..., \alpha_T]\) is a sequence we define15 such that:

15 Usually \(\alpha_1 > \alpha_2 > ... > \alpha_T\) as we wanna be less destructive at the beginning and it doesn’t matter as much in the end. In my implementation, I made \(\alpha\) go from 0.9999 to 0.98 and it worked.

\[\alpha_t \in \mathbb{R}_{[0, 1]} \quad \forall t\]

\[\prod_{t=1:T} \alpha_t \approx 0\]

Conceptually:

  • \(\alpha_t\) tells us “how much of the signal is kept”.
  • \(1 - \alpha_t\) tells us “how big is the noise we add” (variance of the injected noise).

It is useful to see this operation in 2D points:

Figure 5: Visualization of applying forward diffusion process to samples from an initial distribution of \(q = \mathcal{N} (\mu = [5.0, 7.0], \Sigma = 0.05 \cdot I)\). See how in few steps the samples are mapped to something resembling \(\mathcal{N} (0, I)\). I drew some trajectories to get an idea of how things move.

Our videos, instead of being 2-dim vectors are vectors of much much higher dimension: \(16\times32\times32 = 16,384\). The idea is the same though, Figure 6 shows some steps of the diffusion process for a single frame:

Figure 6: Applying 3 steps of the forward diffusion process to frames of our data: \(\vec{x_0}\) is a frame containing the letter “N”. Figure 1 is a good example fo what applying the forward diffusion process to a video looks like.

A nice property of defining the sequence like this is that we can compute \(\vec{x_t}\) directly without iteratively adding noise in multiple steps. This is useful to more efficiently obtain training datapoints at different levels of noise without the need of computing all the intermediate steps.

Notice that:

\[ \begin{split} \vec{x_t} &= \sqrt{\alpha_t} \cdot \vec{x_{t-1}} + \sqrt{1 - \alpha_t} \cdot \vec{\epsilon_{t}}\\ &= \sqrt{\alpha_t} \cdot \underbrace{\left( \sqrt{\alpha_{t-1}} \cdot \vec{x_{t-2}} + \sqrt{1 - \alpha_{t-1}} \cdot \vec{\epsilon_{t-1}} \right)}_{\text{Expression of } x_{t-1}} + \sqrt{1 - \alpha_t} \cdot \vec{\epsilon_{t}}\\ &= \sqrt{\alpha_t \alpha_{t-1}} \cdot \vec{x_{t-2}} + \underbrace{\sqrt{\alpha_t (1 - \alpha_{t-1}}) \cdot \vec{\epsilon_{t-1}} }_{\text{Noise A}} + \underbrace{\sqrt{1 - \alpha_t} \cdot \vec{\epsilon_{t}}}_{\text{Noise B}} \end{split} \]

Now we are left with two sources of noise: \(\text{A}, \text{B}\). Both are centered Gaussians with variances: \(\alpha_t (1 - \alpha_{t-1})\) and \((1 - \alpha_t)\) respectively.

Remember that the sum of two independent Gaussians is a Gaussian whose variance is the sum of both variances (guaranteed they have the same mean). Thus, the variance of \(\text{Noise A} + \text{Noise B}\) is:

\[ \alpha_t (1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t \alpha_{t-1} \]

Which means we can write:

\[ \begin{split} \vec{x_t} = \sqrt{\alpha_t \alpha_{t-1}} \cdot \vec{x_{t-2}} + \sqrt{1 - \alpha_t \alpha_{t-1}} \cdot \vec{\epsilon} \end{split} \]

Intuitively: This makes sense, we are proportionally down-weighting the original signal and up-weighting the added noise, we are just skipping the intermediate step.

Generalizing this we have that:

\[ \vec{x_t} = \sqrt{\prod_{i = 1:t} \alpha_i} \cdot \vec{x_0} + \sqrt{1 - \prod_{i = 1:t} \alpha_i} \cdot \vec{\epsilon} \]

Usually we define and pre-compute the cumulative product: \(\bar{\alpha_t} := \prod_{i = 1:t} \alpha_i\). Getting the expression:

\[ \vec{x_t} = \sqrt{\bar{\alpha_t}} \cdot \vec{x_0} + \sqrt{1 - \bar{\alpha_t}} \cdot \vec{\epsilon} \]

If thinking in terms of probability distributions16 we can also see \(\vec{x_t}\) as a sample of a Normal distribution with mean \(\sqrt{\alpha_t} \cdot \vec{x_{t-1}}\) and variance \(1 - \alpha_t\)17:

16 This is the “reverse” of the reparameterization trick.

17 Note: Think of \(\vec{x_t}\) as the flattened vector of pixels. We multiply variance by \(I\) because there is no covariance in the noise we add to the pixels, it’s iid.

\[ q(\vec{x_t} \mid \vec{x_{t-1}}) = \mathcal{N} \left(\sqrt{\alpha_t} \cdot \vec{x_{t-1}}, (1 - \alpha_t) \cdot I \right) \]

Applying Tip 6 it is easy to see how the final distribution is standard Gaussian:

\[ q(\vec{x_T} \mid \vec{x_0}) = \mathcal{N} \left(\sqrt{\bar{\alpha_T}} \cdot \vec{x_0}, (1 - \bar{\alpha_T}) \cdot I \right) \]

Where since \(\bar{\alpha_T} := \prod_{t=1:T} \alpha_t \approx 0\) we have that:

\[ q(\vec{x_T} \mid \vec{x_0}) = \mathcal{N} \left( \sim 0, \sim I \right) \]

I like to imagine this as a “random walk with drift”. At each step we get pulled towards the origin (drift: \(\sqrt{\alpha_t} \cdot \vec{x_{t-1}}\)) and then we get kicked into a random direction (random walk: \(\sqrt{1 - \alpha_t} \cdot \vec{\epsilon_t}\))

But why do we need the drift?

We want to achieve that the distribution of \(\vec{x_T}\)’s is standard Gaussian: \(\vec{x_T} \sim \mathcal{N}(0, 1)\). If we don’t drift we would get a distribution with those characteristics:

  • Mean: The distribution of \(\vec{x_T}\)’s would be centered at the mean of the distribution of \(\vec{x_0}\)’s.
  • Variance: Would increase at each step, as the variance of the sum of Gaussians is the sum of variances.

I was curious and wrote a little simulation with the following parameters:

  • \(\vec{x_t}, \vec{\epsilon}\) are 1-dim vectors.
  • \(\vec{x_0} = 5\) for all samples.
  • \([\alpha_0, ...., \alpha_1000] = [0.999, ... 0.8]\)
Figure 7: Distribution of \(\vec{x_T}\) if we do the drift, vs we don’t do the drift. Clearly doing the drift brings the final distribution to \(\mathcal{N} (0, 1)\)

Note: Sometimes people talk about “betas” instead of “alphas”, which are defined as: \(\beta_t := 1 - \alpha_t\). Everything is the same but the coefficients are “flipped”.

Reverse diffusion process

Given a noisy datapoint \(\vec{x_t}\), our model will learn to “uncorrupt” it. We are essentially learning the inverse of \(q(\vec{x_t} \mid \vec{x_{t-1}} )\):

\[ p_\theta (\vec{x_{t-1}} \mid \vec{x_t}) \]

This is usually achieved by training a model that maps \(\vec{x_t}\) to either \(\vec{x_0}\), \(\epsilon\), or some combination18. In the following section we’ll se how the model is trained.

18 Velocity for instance $v := $

Training

Ok, we are now in the “following section”, time to see how this works in practice.

Truth be told, the \(p_\theta (\vec{x_{t-1}} \mid \vec{x_t})\) formula was a bit of a simplification in this case. More strictly speaking, since we are doing text-conditioned video generation, our model will be something like:

\[ p_\theta (\vec{\epsilon} \mid \vec{x_t}, t, \overrightarrow{\text{prompt}}) \]

Where:

  • \(\vec{x_t} \in \mathbb{R}^{T\times H \times W}\) is the noisy video.
  • \(\vec{\epsilon} \in \mathbb{R}^{T\times H \times W}\) is the noise added to \(\vec{x_0}\).
  • \(t \in [0, ..., T]\) is the diffusion step19
  • \(\overrightarrow{\text{prompt}} \in \mathbb{N}^s\) is a tokenized encoding of the text displayed in the video. A 1-dim tensor of length \(s\).

19 It ia beneficial to inform the model of the diffusion stage it is in: early steps are more noisy than later ones.

The training process of our “uncorrupting-model” is presented in Tip 8.

Repeat until convergence:

  1. Our dataset generates a \(\left(\overrightarrow{\text{prompt}}, \vec{x_0} \right)\) pair as explained in Section 1.

  2. We randomly pick a diffusion timestep from \([0, ..., T]\)

  3. We sample \(\vec{\epsilon}\) from a Gaussian with the same shape as \(\vec{x_0}\).

  4. We compute \(\vec{x_t} = \sqrt{\bar{\alpha_t}} \cdot \vec{x_0} + \sqrt{1 - \bar{\alpha_t}} \cdot \vec{\epsilon}\)

  5. We forward to the model: \(\left(\vec{x_t}, t, \overrightarrow{\text{prompt}}\right)\), it returns an \(\hat{\epsilon}\) guess20.

  6. We compute the loss and update the model params: \(l = \text{MSE} \left( \vec{\epsilon}, \hat{\epsilon} \right)\)

Obviously, this is all batched.

20 In section Section 2.4 we see how internally this inputs get mapped to the output.

21 Most likely we could have done it much earlier being more efficient and smarter both in architecture design and in the way we sample stuff. Optimizing this, however, wasn’t the point of this project, so I did minimal exploration in this front.

You can check the implemented logic here. When running it, I saw that: Validation reached its minimum after 100k training steps of effective batch size 1024. That is around 100 million samples, at different levels of noise. This means that seeing \(\leq 4\%\) of the 2.2 billion combinations of words was enough to generalize to any sequence21.

Inference

Given a \(\overrightarrow{\text{prompt}}\) by the user, and a sample \(\vec{x_T} \sim \mathcal{N} (0, 1)\) of random noise of the video shape, we iteratively remove the noise following these steps for each \(t = T:0\):

  1. Forward \(\left(\vec{x_T}, T, \overrightarrow{\text{prompt}}\right)\) to the model to obtain \(\hat{\epsilon}\)

  2. Use \(\hat{\epsilon}\) to guess \(\vec{x_0}\) (the video we need to generate):

\[ \hat{\vec{x}}_0 = \frac{\vec{x_T} - \sqrt{1 - \bar{\alpha}_T} \cdot \hat{\epsilon}}{\sqrt{\bar{\alpha}_T}} \]

Here one might think “ok, we have \(\hat{\vec{x}}_0\), we are done let’s go home”. However this \(\hat{\vec{x}}_0\) potentially has a big error! Let’s see why. Imagine our guessed error deviates slightly from the real one:

\[ \hat{\epsilon} = \vec{\epsilon} + \delta \]

We then have that:

\[ \hat{\vec{x}}_0 = \frac{\vec{x_T} - \sqrt{1 - \bar{\alpha}_T} \cdot (\vec{\epsilon} + \delta)}{\sqrt{\bar{\alpha}_T}} \]

Let’s compute the error we are doing \(\hat{\vec{x}}_0 - \vec{x_0}\):

\[ \hat{\vec{x}}_0 - \vec{x_0} = - \frac{\sqrt{1 - \bar{\alpha}_T}}{\sqrt{\bar{\alpha}_T}} \cdot \delta \]

But remember that \(\bar{\alpha_T} = \prod_{t=0:T} \alpha_t \approx 0\) which means that, even if \(\delta\) is relatively small, \(|\hat{\vec{x}}_0 - \vec{x_0}|\) is big 😱

Alright, so that \(\hat{\vec{x}}_0\) is (most likely) not a good guess, but hopefully it goes into the right direction. Let’s use it to update our current guess using the expression we defined:

\[ \vec{x_{T-1}} = \sqrt{\bar{\alpha}_{T-1}} \cdot \hat{\vec{x}}_0 + \sqrt{1 - \bar{\alpha}_{T-1}} \cdot \hat{\epsilon} \]

Let’s see into what direction we are talking this step wrt \(\vec{x_T}\).

\[ \begin{split} \vec{\Delta x} &= \vec{x_{T-1}} - \vec{x_T}\\ &= \underbrace{\left( \sqrt{\bar{\alpha}_{T-1}} - \sqrt{\bar{\alpha}_{T}} \right)}_{a \approx 0^+} \cdot \hat{\vec{x}}_0 + \underbrace{\left( \sqrt{1- \bar{\alpha}_{T-1}} - \sqrt{1- \bar{\alpha}_{T}} \right)}_{b \approx 0^-} \cdot \hat{\epsilon} \end{split} \]

As we can see, we are:

  1. Moving a tiny bit towards \(\hat{\vec{x}}_0\), a distance defined by:

\[ a := \sqrt{\bar{\alpha}_{T-1}} - \sqrt{\bar{\alpha}_{T}} = \underbrace{\sqrt{\bar{\alpha}_{T-1}}}_{\approx 0^+} \cdot \underbrace{( 1 - \sqrt{\alpha_T})}_{\approx 1} \approx 0^+ \]

  1. Slightly removing \(\hat{\epsilon}\) from \(\vec{x_{T}}\), a magnitude given by:

\[ b := \sqrt{1- \bar{\alpha}_{T-1}} - \underbrace{\sqrt{1- \underbrace{\bar{\alpha}_{T-1} \alpha_T}_{\text{Smaller}}}}_{\text{Larger}} \approx 0^- \]

By iteratively applying these transformations we end up converging into a point of our initial data distribution 💃

The update we just described \(\left(\vec{x_{t-1}} = \sqrt{\bar{\alpha}_{t-1}} \cdot \hat{\vec{x}}_0 + \sqrt{1 - \bar{\alpha}_{t-1}} \cdot \hat{\epsilon} \right)\) is deterministic. However, \(p_\theta (\vec{x_{t-1}} \mid \vec{x_t})\) should not be seen as a delta, but as a distribution22. We need to sample from it!

We do so by injecting some new noise at every step. We modulate the amount of fresh randomness with an hyperparameter \(\eta \in \mathbb{R}_{[0, 1]}\). And define:

\[ \sigma_t = \eta\, \sqrt{ \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \left(1-\frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}\right) }. \]

Typically \(\sigma_T > \sigma_{T-1} > ... > \sigma_0 \approx 0\). We use this weighting parameter as this:

\[ \vec{x_{t-1}} = \sqrt{\bar{\alpha}_{t-1}} \cdot \hat{\vec{x}}_0 + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2} \cdot \hat{\epsilon} + \sigma_t \cdot \vec{z}. \]

Where \(\vec{z} \sim \mathcal{N} (0, 1)\).

  • If \(\eta = 0\): We have the deterministic sampler we described before.
  • If \(\eta = 1\): We have that \(\sigma_t\) matches the DDPM posterior std (same as DDPM).

Note that “sampling” a next-point instead of greedily choosing the mean has several advantages:

  • You don’t collapse to a single trajectory (mode collapse-ish behavior).
  • You preserve the right amount of variance at each noise level.
  • You get diversity: different \(z\) means different valid outputs.

22 Even if you have a “perfect” model predicting the noise, there are many plausible “cleaner” samples consistent with \(\vec{x_t}\).

Architecture

Alright, now time to see how we internally map the given the inputs: \(t, \vec{x_t}, \overrightarrow{\text{prompt}}\), we make it predict the output: \(\vec{\epsilon_t}\).

Figure 8: General model architecture I implemented. Dark-gray tensors are inputs, light-gray tensors are intermediate steps / outputs. Notice I didn’t draw stuff like initial projections, residual connections, and normalization layers for clarity.

Obviously, as far as architecture possibilities go, sky is the limit. I decided on this one for simplicity. Plus, since we are dealing with a toy-dataset, we can get away without dimensionality-reduction modules and such23.

23 More on those in Section 4.

If interested, here you can read about individual modules of the architecture:

The patcher takes the 3D structure of a video and re-arranges it into a sequential input so it can be processed by the model. It does so by grouping close spatio-temporal blocks (patches) of the given input. These blocks are defined by 3 parameters: \((p_T, p_H, p_W)\)24. It also implements the inverse of this projection so we can reconstruct the original shape after the sequence has been processed by the model.

Figure 9: How the patcher converts a video input into a sequence and back. In this example we can see how a video of shape \((T=6, H=4, W=4)\) is mapped into a sequence of shape \((T^\prime = 16, f = 6)\) ready to be processed by the transformer. The patcher used parameters \((p_T=3, p_H=1, p_W=2)\).

24 The bigger we make them, the coarser the grouping, but the lower input length of the model (tradeoff).

Since I didn’t want to be bothered with 3D-sinusoidals or 3D-RoPE, I went for learned positional encodings25. Notice we can do this because we meet the following criteria:

  1. The input temporal sequence is are relatively short.
  2. The inputs are of constant length.

The main caveat of this choice is that it might be harder for the model to learn at the beginning: it doesn’t have the positional bias built-in as the other cases.

25 More on this in my post about positional encoding.

We convert the scalar timestep \(t \in [0, T]\) into a high-dimensional vector using sinusoidal positional embeddings26.

As a reminder, this is the embedding formula at timestep \(t\), coordinate \(i\):

\[ \begin{align} p_t^{(i)} := \begin{cases} \sin({\omega_k} \cdot t), & \text{if}\ i = 2k \\ \cos({\omega_k} \cdot t), & \text{if}\ i = 2k + 1 \end{cases} \end{align} \]

Here is a visualization of how the vector looks like at different timesteps:

Code
import plotly.graph_objects as go
import numpy as np

# Parameters
max_timestep = 100
embed_dim = 32  # Must be even

def get_sinusoidal_embedding(t, dim=embed_dim):
    """Compute sinusoidal positional embedding for timestep t."""
    half_dim = dim // 2
    freqs = np.exp(-np.log(10000) * np.arange(half_dim) / half_dim)
    args = t * freqs
    embedding = np.zeros(dim)
    embedding[0::2] = np.sin(args)  # Even indices: sin
    embedding[1::2] = np.cos(args)  # Odd indices: cos
    return embedding

# Build frames for animation (with annotation for timestep label)
frames = []
for t in range(max_timestep + 1):
    emb = get_sinusoidal_embedding(t)
    hover_text = [[f'Dim {i}<br>Value: {emb[i]:.3f}' for i in range(embed_dim)]]
    
    frame_data = [
        go.Heatmap(
            z=[emb],
            colorscale='Reds',
            zmin=-1, zmax=1,
            text=hover_text,
            hovertemplate='%{text}<extra></extra>',
            showscale=False,
            xgap=2, ygap=2
        )
    ]
    frame_layout = dict(
        annotations=[dict(
            text=f"Timestep t = {t}",
            x=0.5, y=-0.4,
            xref="paper", yref="paper",
            showarrow=False,
            font=dict(size=10)
        )]
    )
    frames.append(go.Frame(data=frame_data, layout=frame_layout, name=str(t)))

# Initial state
init_t = 0
init_emb = get_sinusoidal_embedding(init_t)
init_hover = [[f'Dim {i}<br>Value: {init_emb[i]:.3f}' for i in range(embed_dim)]]

fig = go.Figure(
    data=[
        go.Heatmap(
            z=[init_emb],
            colorscale='Reds',
            zmin=-1, zmax=1,
            text=init_hover,
            hovertemplate='%{text}<extra></extra>',
            showscale=False,
            xgap=2, ygap=2
        )
    ],
    frames=frames
)

# Slider for timestep selection
fig.update_layout(
    title=dict(
        text=f"Sinusoidal Positional Embedding (dim={embed_dim})",
        x=0.5,
        font=dict(size=16)
    ),
    xaxis=dict(
        title="",
        tickmode='linear',
        dtick=4,
    ),
    yaxis=dict(
        visible=False,
        scaleanchor='x',
        scaleratio=1.5
    ),
    annotations=[dict(
        text=f"Timestep t = {init_t}",
        x=0.5, y=-0.4,
        xref="paper", yref="paper",
        showarrow=False,
        font=dict(size=10)
    )],
    sliders=[dict(
        active=init_t,
        currentvalue=dict(visible=False),
        pad=dict(t=30, b=10),
        len=0.9,
        x=0.05,
        xanchor="left",
        steps=[
            dict(
                args=[[str(t)], dict(frame=dict(duration=0), mode="immediate")],
                label=str(t),
                method="animate"
            ) for t in range(max_timestep + 1)
        ],
        ticklen=4,
        minorticklen=2,
    )],
    margin=dict(l=40, r=40, t=60, b=120),
    height=260
)

fig.show()
Figure 10

Notice how:

  • High-frequency dimensions (left side) oscillate rapidly: useful for fine-grained distinctions between nearby timesteps.
  • Low-frequency dimensions (right side) change slowly as \(t\) increases: useful for coarse timestep information.

After obtaining this vector, we apply a linear layer so the model uses the information as it better suits it.

26 Same idea as the original Transformer paper positional encodings. Again, I do a deep dive over this in my positional encoding post

Interpretability

Learned positional encodings

I thought it would be interesting to see how do the learned positional encodings looked like. For this I compute the dot product between all of them27

27 Notice that dot-product is not necessarily a relevant metric in this case, since we are simply adding these as biases to the inputs. Still, it’s enough to see some relationships.

Video positional embeddings

Remember the input to the transformer is flattened to a sequence of \(512\) vectors. We have \(32 \times 32 \times 16\) input pixels grouped into blocks of shape \(4 \times 4 \times 2\), which results into \(8 \times 8 \times 8 = 512\) vectors that need positional encoding.

Let’s first take a look at how similar each encoding is to one from a given position, for instance \((T=4, H=5, W=3)\). Remember we are working in patched blocks coordinates, so we are dealing with a block of shape \(T=8 \times H=8 \times W=8\).

We observe these similarities:

Code
import plotly.graph_objects as go
import numpy as np

# Load positional embeddings and compute cosine similarity matrix
pos_embed = np.load('pos_embed_video.npz')['pos_embed_video'][0]  # Shape: (512, 128)
pos_embed_norm = pos_embed / np.linalg.norm(pos_embed, axis=1, keepdims=True)
cos_sim = pos_embed_norm @ pos_embed_norm.T  # Shape: (512, 512)

time_size = 8
grid_size = 8
render_threshold = 0.4

def get_cube_data(ref_w, ref_h, ref_t):
    """Generate cube vertex data for a given reference position."""
    # Index formula: t * (H*W) + h * W + w -> reshape gives (T, H, W) order
    ref_idx = ref_t * (grid_size * grid_size) + ref_h * grid_size + ref_w
    similarities = cos_sim[ref_idx].reshape((time_size, grid_size, grid_size))
    
    val_min, val_max = render_threshold, 1.0
    xs, ys, zs, colors, hovers = [], [], [], [], []
    
    for t in range(time_size):
        for h in range(grid_size):
            for w in range(grid_size):
                value = similarities[t, h, w]  # (T, H, W) order from reshape
                if value <= render_threshold:
                    continue
                
                normalized = min(1.0, max(0.0, (value - val_min) / (val_max - val_min)))
                opacity = 0.3 + 0.7 * normalized  # Range: 0.3 to 1.0
                r = int(180 - 130 * normalized)
                g = int(60 + 180 * normalized)
                b = int(220 - 20 * normalized)
                
                xs.append(w)
                ys.append(h)
                zs.append(t)
                colors.append(f'rgba({r}, {g}, {b}, {opacity:.2f})')
                hovers.append(f'(w={w}, h={h}, t={t})<br>Similarity: {value:.3f}')
    
    return xs, ys, zs, colors, hovers

# Build frames for all 512 positions (8 * 16 * 16)
frames = []
for ref_t in range(time_size):
    for ref_h in range(grid_size):
        for ref_w in range(grid_size):
            xs, ys, zs, colors, hovers = get_cube_data(ref_w, ref_h, ref_t)
            
            # Use scatter3d with markers instead of mesh for performance
            frame_data = [
                go.Scatter3d(
                    x=xs, y=ys, z=zs,
                    mode='markers',
                    marker=dict(
                        size=12,
                        color=colors,
                        symbol='square',
                    ),
                    text=hovers,
                    hovertemplate='%{text}<extra></extra>',
                    showlegend=False
                ),
                # Reference marker
                go.Scatter3d(
                    x=[ref_w], y=[ref_h], z=[ref_t],
                    mode='markers',
                    marker=dict(size=8, color='gold', symbol='diamond',
                               line=dict(color='darkgoldenrod', width=2)),
                    hovertemplate=f'Reference (w={ref_w}, h={ref_h}, t={ref_t})<extra></extra>',
                    showlegend=False
                )
            ]
            frames.append(go.Frame(data=frame_data, name=f'{ref_w}_{ref_h}_{ref_t}'))

# Initial view
init_w, init_h, init_t = 4, 5, 3
xs, ys, zs, colors, hovers = get_cube_data(init_w, init_h, init_t)

fig = go.Figure(
    data=[
        go.Scatter3d(
            x=xs, y=ys, z=zs,
            mode='markers',
            marker=dict(size=12, color=colors, symbol='square'),
            text=hovers,
            hovertemplate='%{text}<extra></extra>',
            showlegend=False
        ),
        go.Scatter3d(
            x=[init_w], y=[init_h], z=[init_t],
            mode='markers',
            marker=dict(size=8, color='gold', symbol='diamond',
                       line=dict(color='darkgoldenrod', width=2)),
            name='Reference'
        ),
        # Colorbar reference
        go.Scatter3d(
            x=[None], y=[None], z=[None], mode='markers',
            marker=dict(size=0.1, color=[0],
                       colorscale=[[0, 'rgb(180, 60, 220)'], [1, 'rgb(50, 240, 200)']],
                       cmin=render_threshold, cmax=1.0,
                       colorbar=dict(title='Sim', thickness=12, len=0.5)),
            showlegend=False, hoverinfo='skip'
        )
    ],
    frames=frames
)

# Create three sliders - one for each dimension
fig.update_layout(
    title=dict(text="Positional Embedding Similarities", x=0.5),
    scene=dict(
        xaxis_title='W (width)',
        yaxis_title='H (height)',
        zaxis_title='T (time)',
        xaxis=dict(tickvals=list(range(grid_size)), range=[-0.5, grid_size-0.5]),
        yaxis=dict(tickvals=list(range(grid_size)), range=[-0.5, grid_size-0.5]),
        zaxis=dict(tickvals=list(range(time_size)), range=[-0.5, time_size-0.5]),
        aspectmode='cube',
        camera=dict(eye=dict(x=1.6, y=1.6, z=1.0))
    ),
    sliders=[
        dict(
            active=init_w, currentvalue={"prefix": "W: ", "font": {"size": 14}},
            pad={"t": 40}, len=0.25, x=0.05, xanchor="left",
            steps=[dict(args=[[f'{w}_{init_h}_{init_t}'], {"frame": {"duration": 0}, "mode": "immediate"}],
                       label=str(w), method="animate") for w in range(grid_size)]
        ),
        dict(
            active=init_h, currentvalue={"prefix": "H: ", "font": {"size": 14}},
            pad={"t": 40}, len=0.25, x=0.38, xanchor="left",
            steps=[dict(args=[[f'{init_w}_{h}_{init_t}'], {"frame": {"duration": 0}, "mode": "immediate"}],
                       label=str(h), method="animate") for h in range(grid_size)]
        ),
        dict(
            active=init_t, currentvalue={"prefix": "T: ", "font": {"size": 14}},
            pad={"t": 40}, len=0.25, x=0.71, xanchor="left",
            steps=[dict(args=[[f'{init_w}_{init_h}_{t}'], {"frame": {"duration": 0}, "mode": "immediate"}],
                       label=str(t), method="animate") for t in range(time_size)]
        ),
    ],
    margin=dict(l=0, r=0, t=50, b=200),
    legend=dict(x=0.85, y=0.95)
)

fig.show()
Figure 11

We can see how the model effectively learned the space-time condition of the problem by observing how similar the embeddings are28:

28 I know it’s obvious but its still quite impressive how this emerged simply by looking at examples of letters moving.

  1. Time: We can clearly see that close-in-time frames share similar embeddings (this similarity fades away with time). The similarity being very high within the same frame.
  2. Space: Given a particular frame, blocks which are closer have more similar embeddings. Implying the model learned the concept of “spatial distance” in this setup.

I went a bit extra with the 3d visualization, maybe it is easier to see the flattened all-to-all similarities:

Figure 12: Unraveled all-to-all similarities. Here we can see how “similar” each of the \(512\) positional encodings is to all the others.

We can derive the same conclusions

  1. We can clearly see the 8 groups of 64 vectors corresponding to each frame-pair: Same-frame vectors have similar encodings.
  2. The encodings from the first and last (couple of) frames are very similar: This is because in our data usually the first frame and last frame are black.
  3. The closer the frames in time, the more similar the encodings. This can be seen by the overall stronger diagonal outline and fading the further away we get.

Prompt positional encodings

Similarly, I was also curious about the learned prompt positional embeddings. I also computed cosine similaritie and saw a healthy-looking diagonal-looking trend :)

Figure 13: Prompt positional embeddings all-to-all cosine similarities. Notice that we never trained on sequences longer than 6 elements, that’s why the last two remain “neutral”.

Experiments

This section is built with a question-answer format of random things that popped into my head while working on this project.

  1. Can the model generate never-seen sequences?

Yes it can! Remember that we made sure the words never started with char A and they never had the char S in the second position. It seems it is not a problem to generate something as “unusual” as the word AS.

Figure 14: No problem whatsoever generating AS.
  1. Can a model that has never seen single-letters generate decent single-letters?

It struggles quite a bit:

(a) Generation of input A. This model had never seen an A at the first position (on top of never seen 1-char prompts).
(b) Generation of input E.
Figure 15: Generation of single letters by a model that has never seen prompts of length 1.
  1. How does it look if we make it write a letter out of its vocabulary?

I thought it’d be interesting to make the model have extra vocabulary tokens even if never trained on. This is what happens when conditioning on something is has never seen:

(a) Generation of input WoRD
(b) Expectation of input WoRD
Figure 16: Making it invent a new letter.
  1. Does the model generalize to longer-than-training sequences?

During training only sequences of up to 6 chars were provided. However, I left some extra empty positions in the prompt length to test generability. This was a bit underwhelming, it simply ignored everything beyond 6-th position 🤷‍♀️ It is to be expected as those positions were always padding in the training set.

Conclusions

Phew 😮‍💨 what started as “let’s see if I can make this work” ended up taking way more time than I expected (as per usual 😅). And, admitedly, I cut on many things I had planned to explore.

Ideas for future work

I’ll list some of the ideas I had just to have them logged:

  1. In interpretability: I think it’d have been cool to also:
    • See the patterns in self-attn and cross-attn given a prompt.
    • See how the model “stores” each shape in the MLPs.
  2. In inference: I’d have liked to more strongly tighten the probabilistic view of the reverse step.
  3. In diffusion basics: Would have beeen nice to explore more the connection between this and other generative modelling paradigms: e.g. normalizing flows and VAEs29.
  4. Address a possible criticism: Given a prompt, each video is unique. Would have been interesting to randomize things like text direction, font, size, speed. To see the effect on the learning and see if it actually generates those variations.

29 Maybe on a future post? Who knows… Not even me, stay tuned tho!

Serious diffusion-based video generation

Anyway, before wrapping up, I think it’d be interesting to see how we would go along solving a more-serious version of this problem. Here I leave some things people do:

  1. Latent diffusion: Instead of diffusing raw pixels, you train an autoencoder (VAE / VQ-ish) and then diffuse in a compressed latent space.

  2. Proper prompt embeddings: Instead of computing embeddings with a simple lookup table, one could use a more sophisticated model like a T5 or a CLIP-like model.

  3. Cascaded generation: Generate something with low-res and have an up-sampling model that refines it. Or generate fewer frames first and then complete to the desired length30.

  4. Space-time factorization: Split attention blocks between to reduce computation needs and bias the network to learn time consistency.

  5. Other stuff: CFG (Classifier-free guidance), better/faster samplers (instead of going through all the trainsteps), more principled noise schedules (instead of these \(\alpha\)’s look good to me), etc.

30 This way we split the “global structure” and the “fine details” into two problems.

In this toy setup, I could get away without most of these because the world is tiny and the task is super structured. But the moment you want higher-res, longer clips, richer motion, and “real” prompts… you’ll want the full toolbox. Hope this was fun and informative! 😁