I go through many core topics of ANN optimization: Gradient descent, the chain rule, backpropagation and computational graphs.
ML Basics
Author
Oleguer Canal
Published
December 28, 2025
Today we will be looking at a very fundamental1 topic in deep learning: How are gradients computed in a library such as PyTorch. Roughly I wanna cover this stuff:
1 Because you gotta be mental not to enjoy this.
From analytical optimization to gradient descent.
From function optimization to model fitting through Maximum Likelihood Estimation.
The chain rule and backpropagation
Building a tiny autodiff engine to make sure we understand everything2. We will use it to train a logistic regressor for a toy classification problem.
2 I just saw that Karpathy did this 5 years ago 😅 FUUCK! Here is the source if interested on how to do this better haha FUCK!
Introduction
Aaah optimization 😌 What isn’t there to love about making everything bigger, or smaller, or faster, or slower, or just… better.
Given:
\[
f: \mathbb{R}^n \rightarrow \mathbb{R}
\]
We would die for:
\[
\min_x \space f(x)
\]
In high school Tip 2 we’d solve this analytically by computing the derivative:
And asked to find the \(x\) such that there is a minimum. You then do something like:
\[
\frac{\partial f(x)}{\partial x} = 4 x + 3
\]
And solve for \(x\) such that \(\frac{\partial f}{\partial x} (x) = 0\).
In this case:
\[
4 x + 3 = 0
\]
\[
x = -\frac{3}{4}
\]
Visualized:
Code
import plotly.express as pximport numpy as npx = np.arange(-2, 2, 0.1)y =2* x * x +3* x +5y_prime =4* x +3# Corrected derivative calculation# Combine both lines into one figurefig = px.line()fig.add_scatter(x=x, y=y, mode='lines', name='f(x): 2x^2 + 3x + 5', line=dict(color='blue'))fig.add_scatter(x=x, y=y_prime, mode='lines', name="f'(x): 4x + 3", line=dict(color='red'))# Find x where derivative crosses zerox_zero =-3/4# Update layout for better visualizationfig.update_layout( width=500, height=400, xaxis=dict(title='x', zeroline=True, zerolinewidth=2, zerolinecolor='black'), yaxis=dict(title='y', zeroline=True, zerolinewidth=2, zerolinecolor='black'),# legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01))# Add a vertical line where the derivative crosses zeroy_at_minimum =2* x_zero * x_zero +3* x_zero +5fig.add_shape(type="line", x0=x_zero, y0=0, x1=x_zero, y1=y_at_minimum, line=dict(color="green", width=2, dash="dashdot"))fig.show()
Life as an adult isn’t as beautiful though. Real life functions can be too complex to obtain the analytical form of their derivatives in all their domain4. Still, if we can at least locally obtain the gradient, we can still find the minimum by using other methods. Let’s give a round of applause to the gradient descent algorithm.
4 Just consider the hotdog-or-not function: a function that tells you whether a picture is a hot dog or not-a-hot-dog 🌭. It is extremely convoluted (pun intended) .
Gradient descent
Gradient descent is an iterative method to optimize a differentiable function5. The main observation is the following:
5 In deep learning (in practice) differentiability in a strict sense is not taken super seriously.
If we take any point \(a\) from \(f\)’s domain, and we move “slightly” in the opposite direction of \(f\)’s gradient at \(a\), \(f\) will decrease:
\[
f(a - \eta \cdot \nabla f(a)) < f(a)
\]
Where \(\eta\) is a small scalar. We can iteratively apply this idea to move a random initial guess towards a local optima6:
6 Obviously, there is a lot of stuff to comment about global vs local optima and convergence guarantees and their mother. I won’t cover it here to limit a bit the scope of this.
\[
\begin{aligned}
&\text{Randomly initialize } x_0 \\
&\text{Repeat until convergence:} \\
&\hspace{1em} \text{Find the next point:} \ x_{k+1} = x_k - \eta \cdot \nabla_x f (x_k) \\
\end{aligned}
\]
The main advantage here is that we don’t need to compute the analytical form of the gradient: We only need it locally for every point we evaluate. 🥳
Note 3: Why does the gradient point to the steepest ascend?
This question popped into my head when writing this. It is not so direct to see, right? Its actually not that hard, so let’s get into it. Consider the following function:
\[
f : \mathbb{R}^n \rightarrow \mathbb{R}
\]
If \(h\) is small enough, we have that the first-order Taylor expansion of \(f\) around \(a\) is:
This is a bit hand-wavy 👋, but the reasoning goes like this:
\(\nabla f(\vec{a}) \cdot \vec{h}\) is maximized when \(\vec{h}\) is in the direction of \(\nabla f(\vec{a})\) (Cauchy-Schwarz inequality).
If we want to maximize \(f(\vec{a} + \vec{h})\), h needs to be in the direction of \(\nabla f(\vec{a})\).
Therefore, the direction of maximum ascent of \(f\) from \(a\) is the direction of \(\nabla f(\vec{a})\).
Tip 4: Gradient descent on your high-school problem
Ok, let’s see how we would apply gradient descent in our previous problem Tip 2.
We would start with a random guess of the optimal point \(x_0\). Let’s pick \(x_0 = 0\) for simplicity, and let’s say our learning rate is \(\eta = 0.4\) and let’s run it for \(5\) iterations.
In 5 steps we got to a point \(x_5 = -0.80832\) which is closer to the optimal point \(x^* = -\frac{3}{4} = -0.75\). Visualized:
Code
import plotly.express as pximport plotly.graph_objects as goimport numpy as np# Define functionsdef f(x):return2* x * x +3* x +5def f_prime(x):return4* x +3# Perform gradient descentx_0 =0x_1 = x_0 -0.4* f_prime(x_0)x_2 = x_1 -0.4* f_prime(x_1)x_3 = x_2 -0.4* f_prime(x_2)x_4 = x_3 -0.4* f_prime(x_3)x_5 = x_4 -0.4* f_prime(x_4)x = np.arange(-2, 2, 0.1)y = f(x)y_prime = f_prime(x)fig = px.line(x=x, y=y)fig.add_scatter(x=x, y=y_prime, mode='lines', showlegend=False)# Add labels to each point showing which is which (x_0, x_1, ..., x_5)x_vals = [x_0, x_1, x_2, x_3, x_4, x_5]y_vals = [f(x_0), f(x_1), f(x_2), f(x_3), f(x_4), f(x_5)]labels = [str(i) for i inrange(6)]fig.add_scatter(x=x_vals, y=y_vals, mode='markers+text', text=labels, textposition='top right', showlegend=False)# Add arrows indicating gradient directionfor i inrange(len(x_vals) -1): fig.add_annotation( x=(x_vals[i+1] + x_vals[i]) /2, y=y_vals[i], ax=x_vals[i], ay=y_vals[i], xref='x', yref='y', axref='x', ayref='y', arrowhead=1, arrowsize=1, arrowwidth=1, arrowcolor='red', showarrow=True )fig.show()
Model optimization
Finding the minimum of a function is cool and all, but that’s not what we came for, is it? Things get a bit more meta when we start talking about fitting models.
More specifically, we are usually given a dataset7:
7 We’ll focus on parametric discriminative supervised learning
And we are asked to model this data8. For this, we assume some functional form9 depending on some parameters \(\theta\) such as:
8 Aka learn the mapping between \(x_i\) and \(y_i\)
9 Notice this is a source of bias! We are choosing a function based on our “expertise” which could lead into a bad model. Common errors are choosing an under-parametrized function (causing under-fitting), choosing an over-parametrized or high-variance-susceptible function (potentially causing over-fitting), or simply choosing a function unable of representing \(\mathcal{D}\).
\[
f_{\theta} (x) = \text{whatever you want}
\]
And also choose some loss function\(\mathcal{L} : \mathbb{R}^{|\theta | } \rightarrow \mathbb{R}\):
\[
\mathcal{L} (f_{\theta}, \mathcal{D}) = \text{also whatever you want}
\]
It is important to notice that we are not longer optimizing wrt the function input variable \(x\). We are optimizing wrt the parameters\(\theta\). This is our new variable (actually all \(x\)’s are fixed and given by the dataset). Likewise, we are not finding a minimum of \(f\) anymore, but a minimum of our loss\(\mathcal{L}\)11. To make is more similar to the previous notation we can write it like so:
11 Which represents a notion of “how bad \(f\) is at modeling the data”
Ok, amazing! But… How does gradient descent fit into this you ask? Same as before I answer! But searching within the parameter-space instead of the input-space. Given some learning rate \(\eta\):
Ok, not so bad. But wait…. Computing all parameter gradients \(\nabla_\theta \mathcal{L}(\hat{y}_k, y)\) might not be so straight forward, right? After all, \(f\) can be extremely complex.
Writing an autograd engine
Before we get into the lio, let’s define a few things.
A computational graph is a a directed acyclic graph (DAG) that encodes how a computation is built from primitive operations. This graph has two types of nodes:
Data nodes: nodes that represent the data to be operated on.
Operation nodes: nodes that represent the operations to be performed.
Figure 1: Computational graph of the function \(y = w \cdot x + b\). Here I represented data nodes with 🟩 and operation nodes as 🔵.
Crucially, each data node has a pointer to the operation that created it (unless they are “leaf nodes” 🍂). Similarly, each operation node has pointers to the data nodes that were inputed to it.
The use of this graph is twofold:
Forward pass: We sequentially compute the values of the data nodes in the graph (from left to right \(\rightarrow\) in the picture), aplying the operations described by the operation nodes.
Backward pass: We backpropagate the gradients (from right to left \(\leftarrow\)) through the graph, computing the gradients of each data node with respect to the loss. For this we make use of the chain rule.
Note 5: The chain rule reminder & backpropagation
Turns out that the derivative of a composition of functions is the product of the derivatives of the functions:
\[
\frac{\partial \left[ g \circ f \right] }{\partial x} (x) =
\frac{\partial g }{\partial f} \left( f(x) \right)
\cdot
\frac{\partial f }{\partial x} (x)
\]
In the computational grpah of Figure 1, we could write:
Tells us how much the output \(y\) would change if we changed \(z\) by a unit (infinitessimal unit). Similarly, \(\frac{\partial z}{\partial w}\) tells us how much \(z\) would change if we changed \(w\) by a unit. Thus, the product of these two tells us how much \(y\) would change if we changed \(w\) by a unit.
Putting some random numbers:
If for instance \(\frac{\partial y}{\partial z} = 3\). This means that if \(z\) had been a unit larger, \(y\) would have been \(3\) units larger.
In parallel, if \(\frac{\partial z}{\partial w} = -2\), this means that if \(w\) had been a unit larger, \(z\) would have been \(2\) units smaller.
Thus, if \(w\) had been a unit larger, \(y\) would have been \(3 \cdot -2 = -6\) units smaller.
If we were aiming to minimize \(y\), we would want to update \(w\) making it larger. In addition, we know that the magnitude that \(w\) has had in \(y\) is \(6\). Thus, our optimizer will update it by this factor.
Note 6: Adversarial attacks and deep dreams
Notice that we could obtain the gradient of the loss wrt the input:
Where \(\epsilon\) is a small scalar. This small perturbation of the input is in the direction which maximizes the loss. I.e. we are changing the input such that the model is more likely to output something wrong.
Deep dreams
Alternatively, we can be extra funny and modify the input such that the loss is minimized like so:
Usually we do this multiple times to ensure we see something interesting. Disclaimer: I oversimplified this a lot just to grasp the general idea. In practice, when doing this, one chooses some internal feature to maximize (instead of the target loss). I go in much more detail in my post on Deep dreams.
Ok, this out of the way, let’s get to the lio. The plan is more or less the following:
We will create a Tensor class, which will be the data nodes of our computational graph.
We create the Operation class, which will be the operation nodes of our computational graph. Then we implement a bunch of operations like multiplication, addition, logarithm…
We test the engine by training a small model on some toy example we design.
The Tensor class
First things first, let’s add the data that our “data node”.
class Tensor:def__init__(self, data : float, requires_grad : bool=False) ->None:self.data = data # the actual data (only python float supported in this toy example)self.requires_grad = requires_grad # whether we need to compute gradients for this nodeself.grad =None# gradient in that nodeself.grad_fn =None# pointer to the operation that produced this tensor
Piece of 🍰. Let’s now add the backprop function. We will split it into 2 functions:
_get_topological_sorted_graph: When called on tensor \(t\), this gives us a flattened list of all the tensors that were involved in the computation of \(t\). The result is given in topological order which means that for each directed edge \(e: u \mapsto v\), \(u\) will appear before \(v\) in the resulting list12.
_backprop_grads: This function traverses the previous graph in reversed order (from newest to oldest tensor). And passes the gradients from one to the other. The backpropagated gradients get modified each time they traverse an “operation node” according to the nature of that operation.
12 Aka “things are in order of appearance”
For now we will assume we have the Operation class implemented and we have a function that tels us how that operation alters the input gradients, and we have access to the Tensors that were inputs to that operation instance (aka parents).
class Tensor:def__init__(self, data : float, requires_grad : bool=False) ->None:self.data = dataself.requires_grad = requires_gradself.grad =Noneself.grad_fn =Nonedef _get_topological_sorted_graph(self) -> List["Tensor"]:""" Return list of tensors in the computational graph up until self. The list is sorted in topological order (aka from left to right, aka from oldest to newest) """ topo: List[Tensor] = [] visited: Set[int] =set()def build_topo(t: Tensor) ->None: tid =id(t)if tid in visited:return visited.add(tid)if t.grad_fn isNone:returnfor p in t.grad_fn.parents: build_topo(p) topo.append(t) build_topo(self)return topodef _backprop_grads(self, topo : List["Tensor"]) ->None:""" Walk graph in reverse topological order i.e. from right to left, pushing gradients backwards """for t inreversed(topo):if t.grad_fn isNone:continue# leaf tensor; nothing to propagate furtherif t.grad isNone:continue# no upstream flow reached this node# Local backward. Gradients of: ∂ operation / ∂ inputs parent_grads = t.grad_fn.backward(t.grad)for parent, local_grad inzip(t.grad_fn.parents, parent_grads):if local_grad isNone:continueif parent.requires_grad: parent.grad = (parent.grad or0.0) +float(local_grad)def backward(self, grad : float=1.0) ->None:ifnotself.requires_grad:return# Get sorted list of tensors topo =self._get_topological_sorted_graph()# Seed gradient on the outputself.grad = (self.grad or0.0) +float(grad)# Backprop gradientsself._backprop_grads(topo)
The Operation class
Amazing! Let’s now write an abstract class for a generic operation. The forward and backward methods will need to be implemented by each operation type as:
forward method: performs the operation on the tensors
backward method: defines how the gradients of each input a
We’ll create a class method for convenience which
class Operation(abc.ABC):def__init__(self, *parents):self.parents = parents # pointers to the tensors that are the inputs to this operation@abc.abstractmethoddef forward(self, *inputs):# What the operation does to the inputspass@abc.abstractmethoddef backward(self, *grads):# What the operation does to the gradientspass@classmethoddef__call__(cls, *parents): operation = cls(*parents) # instantiate the operation node out_data = operation.forward(*[p.data for p in parents]) # compute the output requires_grad =any(p.requires_grad for p in parents) # check if output requires grad out = Tensor(out_data, requires_grad=requires_grad) # create the output tensorif requires_grad: out.grad_fn = operation # store the operation that produced this tensorreturn out
Let’s implement some operations!
Given the abstract class, we are now ready to implement operations like churros! For now let’s get some of the basics: Multiplication, Addition, Logarithm and Sigmoid.
class Multiplication(Operation):""" z = a * b dz/da = b dz/db = a """def forward(self, a : Tensor, b : Tensor) ->float:return a.data * b.datadef backward(self, upstream_grad): a, b =self.parents dz_da = b.data * upstream_grad dz_db = a.data * upstream_gradreturn (dz_da, dz_db) -> Tuple[float]
class Addition(Operation):""" z = a + b dz/da = 1 dz/db = 1 """def forward(self, a : Tensor, b : Tensor) ->float:return a.data + b.datadef backward(self, upstream_grad : float) -> Tuple[float]: a, b =self.parents dz_da =1.* upstream_grad dz_db =1.* upstream_gradreturn (dz_da, dz_db)
class Log(Operation):""" f(x) = log(x) df/dx = 1 / x """def forward(self, x : Tensor) ->float:return math.log(x.data)def backward(self, upstream_grad : float) ->float: x =self.parents[0] df_dx =1.0/ x.datareturn (upstream_grad * df_dx,)
Perfect, our simple autodiff engine is ready! Now it is time to test it out, isn’t this very exciting? Yes it is.
We’ll be solving a classification problem of points in a 2D space with two labels, using Logistic Regression 🦕. We’ll follow the common ML problem flow:
Dataset: In this case we are going to synthetically generate some clusters and assign the two possible labels to them.
Model: We’ll go for the old trusty logistic regression: \(f(x) = \sigma(w_1 * x_1 + w_2 * x_2 + b)\). Aka 1-layer ANN with sigmoid activation function.
class GradientDecentOptimizer:def__init__(self, parameters: list, lr : float=0.01):self.parameters = parametersself.lr = lrdef step(self):for param inself.parameters: param.data = param.data -self.lr * param.grad
Trainer
We instantiate the model, loss and optimizer and perform the training loop.
model = LogisticRegressionModel(w_1=-1.0, w_2=1.0, b=2.0) # SOme random initial valuesloss_fn = BCELoss()optimizer = GradientDecentOptimizer(model.params, lr=0.1)for epoch inrange(1):print("\nEPOCH:", epoch)for n_ex, example inenumerate(dataset): model.zero_grad() x_1 = Tensor(data=example["x_1"], requires_grad=True) x_2 = Tensor(data=example["x_2"], requires_grad=True) y_pred = model.forward(x_1, x_2) y_true = example["y_true"] loss = loss_fn.forward(y_pred, y_true); loss.name ="loss"; loss.role ="output" loss.backward() optimizer.step()
if we run this, we can see that the logistic regression indeed learns to model our data 😀:
Decision boundary evolution during learning. looking cool AF 😎 (this should be a gif that moves, not sure if it’ll work)
I also thought it’d be interesting to visualize the overall computational graph. Here we can see the stored values after a forward and backward pass:
Computational graph of the model and loss at the last training step. Color coding goes like this: 🟠 model learnable parameters, 🟢 input variables, 🔵 intermediate variables, 🟣 outputs.
Cool stuff, cool stuff. Anyway, I think it’s gonna be a wrap for today. Truth is I wanted to refresh my memory on backpropagation and some of the basics and got a bit carried away (as it happens). Next steps (which I’d love to do but I doubt I find the time) include implementing this for multi-dimensional tensors. Let’s see if one day I get heat up and I end up writing it 🔥.