Bocachancla – Bocachancla 🫦🩴

Starred Posts ⭐

Building a low-budget diffusion-based video generation model

Learning about text-conditional diffusion through a toy example.

Linear Transformers, Mamba2, and many ramblings

I go through architectures used in sequence modelling: FFN, CNN, RNN, SSM, and Transformers along with many efficiency optimization attempts. I provide intuitive understanding of how they work, and analyze their strengths and weaknesses. All while paying special attention (pun intended) to their computational and memory complexities.

HiPPOs 🦛, Mambas 🐍, and other creatures

I go through the research journey that lead into Mamba. I first review SSMs, then explore the models: HiPPO, S4, DSS, and finally Mamba.

Understanding Transformers

Dot-product attention enhancements: MHA, MQA, GQA, and MHLA

Starting from dot-product attention, I present and give intuitive understanding of the main variants of the attention mechanism, namely: Multi-Head, Multi-Query, Grouped-Query, and Multi-Head Latent attentions.

The wonderful world of positional encoding

I give an intuitive understanding of sinusoidal positional encodings, RoPE, and ALiBi. I finally asses their contribution to the learning process wrt not using any encoding or learned encodings.

RLHF & PPO: From text continuation to aligned assistant

Train a model to generalize human preference, use it to align your LLM.

ML Basics

Building a tiny autograd engine

I go through many core topics of ANN optimization: Gradient descent, the chain rule, backpropagation and computational graphs.

Taxonomy of ML loss functions

Today I review the most well-known loss functions running around. I classify them by their task type.

LASSO vs Ridge regularization

Very quick review of L1 vs L2 regularization techniques

Miscellaneous

GPU programming basics

I go through the main architectural differences between a CPU and a GPU and give some intuition on how CUDA programming works.

Deep Dreams (Are Made of This)

What if instead of doing gradient descent on model parameters, we do gradient ascent on the model input?

Competitive programming in Python

The very basics for LeetCode success using Python.

Ramblings around information theory

Today we’ll look into the concept of information from a probabilistic perspective ¹. Hold on to your hat 👒 because we will connect topics as random as: