Today we will be looking at a very fundamental¹ topic in deep learning: How are gradients computed in a library such as PyTorch. Roughly I wanna cover this stuff:

¹ Because you gotta be mental not to enjoy this.

From analytical optimization to gradient descent.
From function optimization to model fitting through Maximum Likelihood Estimation.
The chain rule and backpropagation
Building a tiny autodiff engine to make sure we understand everything². We will use it to train a logistic regressor for a toy classification problem.

² I just saw that Karpathy did this 5 years ago 😅 FUUCK! Here is the source if interested on how to do this better haha FUCK!

Introduction

Aaah optimization 😌 What isn’t there to love about making everything bigger, or smaller, or faster, or slower, or just… better.

Given:

\[ f: \mathbb{R}^n \rightarrow \mathbb{R} \]

We would die for:

\[ \min_x \space f(x) \]

In high school Tip 2 we’d solve this analytically by computing the derivative:

\[ \frac{\partial f}{\partial x} (x) \]

And finding the “zeros” of this function:

\[ \{x_0, x_1, x_2, ...\} \quad \text{s.t.} \quad \frac{\partial f}{\partial x} (x_i) = 0 \]

With the rationale of:

If the slope is \(0\); there must be a minimum, maximum, or saddle point at \(x_i\)

We would then call it a day and move on to play or something like that ☀️

Note 1: Basics around derivatives

First off, let’s remember the definition of the derivative:

\[ \frac{df}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} \]

Intuitively, it tells us: How much does the function change proportionally if we move a unit³ in a given direction.

\[ f(x + h) - f(x) \approx \frac{df}{dx} \cdot h \]

Geometrically, the derivative is the slope of the tangent line to the function at the point \(x\).

This is all the same thing, but depending on the dimensions of the inputs and outputs of \(f\), we call it:

Derivative if \(f: \mathbb{R} \rightarrow \mathbb{R}\) (same as what we just saw)

Gradient if \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) (vector of derivatives)

\[ \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} (x) \\ \frac{\partial f}{\partial x_2} (x) \\ \vdots \\ \frac{\partial f}{\partial x_n} (x) \end{bmatrix} \]

We will see later that this gradient vector is the direction of the steepest ascent of \(f\) at \(x\) Tip 3.

Jacobian if \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) (matrix of derivatives)

\[ J_f(x) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} (x) & \frac{\partial f_1}{\partial x_2} (x) & \dots & \frac{\partial f_1}{\partial x_n} (x) \\ \frac{\partial f_2}{\partial x_1} (x) & \frac{\partial f_2}{\partial x_2} (x) & \dots & \frac{\partial f_2}{\partial x_n} (x) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} (x) & \frac{\partial f_m}{\partial x_2} (x) & \dots & \frac{\partial f_m}{\partial x_n} (x) \end{bmatrix} \]

³ Here I mean “infinitessimal unit”

Tip 2: Your high-school optimization problem

Usually, you are given a function such as:

\[ f(x) = 2 x^2 + 3 x + 5 \]

And asked to find the \(x\) such that there is a minimum. You then do something like:

\[ \frac{\partial f(x)}{\partial x} = 4 x + 3 \]

And solve for \(x\) such that \(\frac{\partial f}{\partial x} (x) = 0\).

In this case:

\[ 4 x + 3 = 0 \]

\[ x = -\frac{3}{4} \]

Visualized:

Code

import plotly.express as px
import numpy as np

x = np.arange(-2, 2, 0.1)
y = 2 * x * x + 3 * x + 5
y_prime = 4 * x + 3  # Corrected derivative calculation

# Combine both lines into one figure
fig = px.line()
fig.add_scatter(x=x, y=y, mode='lines', name='f(x): 2x^2 + 3x + 5', line=dict(color='blue'))
fig.add_scatter(x=x, y=y_prime, mode='lines', name="f'(x): 4x + 3", line=dict(color='red'))

# Find x where derivative crosses zero
x_zero = -3 / 4

# Update layout for better visualization
fig.update_layout(
    width=500,
    height=400,
    xaxis=dict(title='x', zeroline=True, zerolinewidth=2, zerolinecolor='black'),
    yaxis=dict(title='y', zeroline=True, zerolinewidth=2, zerolinecolor='black'),
    # legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01)
)

# Add a vertical line where the derivative crosses zero
y_at_minimum = 2 * x_zero * x_zero + 3 * x_zero + 5
fig.add_shape(
    type="line",
    x0=x_zero, y0=0, x1=x_zero, y1=y_at_minimum,
    line=dict(color="green", width=2, dash="dashdot")
)

fig.show()

Life as an adult isn’t as beautiful though. Real life functions can be too complex to obtain the analytical form of their derivatives in all their domain⁴. Still, if we can at least locally obtain the gradient, we can still find the minimum by using other methods. Let’s give a round of applause to the gradient descent algorithm.

⁴ Just consider the hotdog-or-not function: a function that tells you whether a picture is a hot dog or not-a-hot-dog 🌭. It is extremely convoluted (pun intended) .

Gradient descent

Gradient descent is an iterative method to optimize a differentiable function⁵. The main observation is the following:

⁵ In deep learning (in practice) differentiability in a strict sense is not taken super seriously.

If we take any point \(a\) from \(f\)’s domain, and we move “slightly” in the opposite direction of \(f\)’s gradient at \(a\), \(f\) will decrease:

\[ f(a - \eta \cdot \nabla f(a)) < f(a) \]

Where \(\eta\) is a small scalar. We can iteratively apply this idea to move a random initial guess towards a local optima⁶:

⁶ Obviously, there is a lot of stuff to comment about global vs local optima and convergence guarantees and their mother. I won’t cover it here to limit a bit the scope of this.

\[ \begin{aligned} &\text{Randomly initialize } x_0 \\ &\text{Repeat until convergence:} \\ &\hspace{1em} \text{Find the next point:} \ x_{k+1} = x_k - \eta \cdot \nabla_x f (x_k) \\ \end{aligned} \]

The main advantage here is that we don’t need to compute the analytical form of the gradient: We only need it locally for every point we evaluate. 🥳

Note 3: Why does the gradient point to the steepest ascend?

This question popped into my head when writing this. It is not so direct to see, right? Its actually not that hard, so let’s get into it. Consider the following function:

\[ f : \mathbb{R}^n \rightarrow \mathbb{R} \]

If \(h\) is small enough, we have that the first-order Taylor expansion of \(f\) around \(a\) is:

\[ f(\vec{a} + \vec{h}) \approx f(\vec{a}) + \nabla f(\vec{a}) \cdot \vec{h} \]

In matrix form:

\[ f \left( \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} + \begin{bmatrix} h_1 \\ h_2 \\ \vdots \\ h_n \end{bmatrix} \right) \approx f \left( \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} \right) + \begin{bmatrix} \frac{\partial f}{\partial x_1} (a_1) & \frac{\partial f}{\partial x_2} (a_2) & \dots & \frac{\partial f}{\partial x_n} (a_n) \end{bmatrix} \cdot \begin{bmatrix} h_1 \\ h_2 \\ \vdots \\ h_n \end{bmatrix} \]

This is a bit hand-wavy 👋, but the reasoning goes like this:

\(\nabla f(\vec{a}) \cdot \vec{h}\) is maximized when \(\vec{h}\) is in the direction of \(\nabla f(\vec{a})\) (Cauchy-Schwarz inequality).
If we want to maximize \(f(\vec{a} + \vec{h})\), h needs to be in the direction of \(\nabla f(\vec{a})\).
Therefore, the direction of maximum ascent of \(f\) from \(a\) is the direction of \(\nabla f(\vec{a})\).

Tip 4: Gradient descent on your high-school problem

Ok, let’s see how we would apply gradient descent in our previous problem Tip 2.

We would start with a random guess of the optimal point \(x_0\). Let’s pick \(x_0 = 0\) for simplicity, and let’s say our learning rate is \(\eta = 0.4\) and let’s run it for \(5\) iterations.

We have:

\[ \begin{aligned} x_0 &= 0 \quad \text{initial guess}\\ x_1 &= x_0 - \eta \cdot \frac{\partial f}{\partial x} (x_0) = 0 - 0.4 \cdot (4 \cdot 0 + 3) = -1.2 \\ x_2 &= x_1 - \eta \cdot \frac{\partial f}{\partial x} (x_1) = -1.2 - 0.4 \cdot (4 \cdot -1.2 + 3) = -0.48 \\ x_3 &= x_2 - \eta \cdot \frac{\partial f}{\partial x} (x_2) = -0.48 - 0.4 \cdot (4 \cdot -0.48 + 3) = -0.912 \\ x_4 &= x_3 - \eta \cdot \frac{\partial f}{\partial x} (x_3) = -0.912 - 0.4 \cdot (4 \cdot -0.912 + 3) = -0.6528 \\ x_5 &= x_4 - \eta \cdot \frac{\partial f}{\partial x} (x_4) = -0.6528 - 0.4 \cdot (4 \cdot -0.6528 + 3) = -0.80832 \end{aligned} \]

In 5 steps we got to a point \(x_5 = -0.80832\) which is closer to the optimal point \(x^* = -\frac{3}{4} = -0.75\). Visualized:

Code

import plotly.express as px
import plotly.graph_objects as go
import numpy as np

# Define functions

def f(x):
    return 2 * x * x + 3 * x + 5
  

def f_prime(x):
    return 4 * x + 3

# Perform gradient descent
x_0 = 0
x_1 = x_0 - 0.4 * f_prime(x_0)
x_2 = x_1 - 0.4 * f_prime(x_1)
x_3 = x_2 - 0.4 * f_prime(x_2)
x_4 = x_3 - 0.4 * f_prime(x_3)
x_5 = x_4 - 0.4 * f_prime(x_4)

x = np.arange(-2, 2, 0.1)
y = f(x)
y_prime = f_prime(x)

fig = px.line(x=x, y=y)
fig.add_scatter(x=x, y=y_prime, mode='lines', showlegend=False)

# Add labels to each point showing which is which (x_0, x_1, ..., x_5)
x_vals = [x_0, x_1, x_2, x_3, x_4, x_5]
y_vals = [f(x_0), f(x_1), f(x_2), f(x_3), f(x_4), f(x_5)]
labels = [str(i) for i in range(6)]
fig.add_scatter(x=x_vals, y=y_vals, mode='markers+text', text=labels, textposition='top right', showlegend=False)

# Add arrows indicating gradient direction
for i in range(len(x_vals) - 1):
    fig.add_annotation(
        x=(x_vals[i+1] + x_vals[i]) / 2,
        y=y_vals[i],
        ax=x_vals[i],
        ay=y_vals[i],
        xref='x',
        yref='y',
        axref='x',
        ayref='y',
        arrowhead=1,
        arrowsize=1,
        arrowwidth=1,
        arrowcolor='red',
        showarrow=True
    )

fig.show()

Model optimization

Finding the minimum of a function is cool and all, but that’s not what we came for, is it? Things get a bit more meta when we start talking about fitting models.

More specifically, we are usually given a dataset⁷:

⁷ We’ll focus on parametric discriminative supervised learning

\[ \mathcal{D} = \{ (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) \} \]

And we are asked to model this data⁸. For this, we assume some functional form⁹ depending on some parameters \(\theta\) such as:

⁸ Aka learn the mapping between \(x_i\) and \(y_i\)

⁹ Notice this is a source of bias! We are choosing a function based on our “expertise” which could lead into a bad model. Common errors are choosing an under-parametrized function (causing under-fitting), choosing an over-parametrized or high-variance-susceptible function (potentially causing over-fitting), or simply choosing a function unable of representing \(\mathcal{D}\).

\[ f_{\theta} (x) = \text{whatever you want} \]

And also choose some loss function \(\mathcal{L} : \mathbb{R}^{|\theta | } \rightarrow \mathbb{R}\):

\[ \mathcal{L} (f_{\theta}, \mathcal{D}) = \text{also whatever you want} \]

And we try to solve this optimization problem¹⁰:

¹⁰ This setup we just described is known as maximum likelihood estimation or MLE if you are into acronyms.

\[ \DeclareMathOperator*{\argmin}{arg\,min} \argmin_{\theta} \quad \mathcal{L} \left( f_{\theta}, \mathcal{D} \right) \]

It is important to notice that we are not longer optimizing wrt the function input variable \(x\). We are optimizing wrt the parameters \(\theta\). This is our new variable (actually all \(x\)’s are fixed and given by the dataset). Likewise, we are not finding a minimum of \(f\) anymore, but a minimum of our loss \(\mathcal{L}\)¹¹. To make is more similar to the previous notation we can write it like so:

¹¹ Which represents a notion of “how bad \(f\) is at modeling the data”

\[ \arg \min_{\theta} \quad \mathcal{L}_{f, \mathcal{D}}(\theta) \]

Ok, amazing! But… How does gradient descent fit into this you ask? Same as before I answer! But searching within the parameter-space instead of the input-space. Given some learning rate \(\eta\):

\[ \begin{aligned} &\text{Randomly initialize } \theta_0 \\ &\text{Repeat until convergence:} \\ &\hspace{1em} \text{"Forward" pass of the data:} \\ &\hspace{2em} \text{Compute guesses:} \ \hat{y}_k = f_{\theta_k}(x) \\ &\hspace{2em} \text{Compute loss:} \ \mathcal{L}(\hat{y}_k, y) \\ &\hspace{1em} \text{"Backward" pass:} \\ &\hspace{2em} \text{Compute gradient:} \ \nabla_\theta \mathcal{L}(\hat{y}_k, y) \\ &\hspace{2em} \text{Update parameters:} \ \theta_{k+1} = \theta_k - \eta \cdot \nabla_\theta \mathcal{L}(\hat{y}_k, y) \\ \end{aligned} \]

Ok, not so bad. But wait…. Computing all parameter gradients \(\nabla_\theta \mathcal{L}(\hat{y}_k, y)\) might not be so straight forward, right? After all, \(f\) can be extremely complex.

Writing an autograd engine

Before we get into the lio, let’s define a few things.

A computational graph is a a directed acyclic graph (DAG) that encodes how a computation is built from primitive operations. This graph has two types of nodes:

Data nodes: nodes that represent the data to be operated on.
Operation nodes: nodes that represent the operations to be performed.

Figure 1: Computational graph of the function \(y = w \cdot x + b\). Here I represented data nodes with 🟩 and operation nodes as 🔵.

Crucially, each data node has a pointer to the operation that created it (unless they are “leaf nodes” 🍂). Similarly, each operation node has pointers to the data nodes that were inputed to it.

The use of this graph is twofold:

Forward pass: We sequentially compute the values of the data nodes in the graph (from left to right \(\rightarrow\) in the picture), aplying the operations described by the operation nodes.
Backward pass: We backpropagate the gradients (from right to left \(\leftarrow\)) through the graph, computing the gradients of each data node with respect to the loss. For this we make use of the chain rule.

Note 5: The chain rule reminder & backpropagation

Turns out that the derivative of a composition of functions is the product of the derivatives of the functions:

\[ \frac{\partial \left[ g \circ f \right] }{\partial x} (x) = \frac{\partial g }{\partial f} \left( f(x) \right) \cdot \frac{\partial f }{\partial x} (x) \]

In the computational grpah of Figure 1, we could write:

\[ \frac{\partial y}{\partial w} = \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w} \]

Intuitively:

\[ \frac{\partial y}{\partial z} \]

Tells us how much the output \(y\) would change if we changed \(z\) by a unit (infinitessimal unit). Similarly, \(\frac{\partial z}{\partial w}\) tells us how much \(z\) would change if we changed \(w\) by a unit. Thus, the product of these two tells us how much \(y\) would change if we changed \(w\) by a unit.

Putting some random numbers:

If for instance \(\frac{\partial y}{\partial z} = 3\). This means that if \(z\) had been a unit larger, \(y\) would have been \(3\) units larger.
In parallel, if \(\frac{\partial z}{\partial w} = -2\), this means that if \(w\) had been a unit larger, \(z\) would have been \(2\) units smaller.
Thus, if \(w\) had been a unit larger, \(y\) would have been \(3 \cdot -2 = -6\) units smaller.
If we were aiming to minimize \(y\), we would want to update \(w\) making it larger. In addition, we know that the magnitude that \(w\) has had in \(y\) is \(6\). Thus, our optimizer will update it by this factor.

Note 6: Adversarial attacks and deep dreams

Notice that we could obtain the gradient of the loss wrt the input:

\[ \frac{\partial \mathcal{L}}{\partial \vec{x}} \]

Adversarial attacks

If we are trying to be funny, we could modify x such that the loss is maximized. Simply by doing something like:

\[ \vec{x} \leftarrow \vec{x} + \epsilon \cdot \frac{\partial \mathcal{L}}{\partial \vec{x}} \]

Where \(\epsilon\) is a small scalar. This small perturbation of the input is in the direction which maximizes the loss. I.e. we are changing the input such that the model is more likely to output something wrong.

Deep dreams

Alternatively, we can be extra funny and modify the input such that the loss is minimized like so:

\[ \vec{x} \leftarrow \vec{x} - \epsilon \cdot \frac{\partial \mathcal{L}}{\partial \vec{x}} \]

Usually we do this multiple times to ensure we see something interesting. Disclaimer: I oversimplified this a lot just to grasp the general idea. In practice, when doing this, one chooses some internal feature to maximize (instead of the target loss). I go in much more detail in my post on Deep dreams.

Ok, this out of the way, let’s get to the lio. The plan is more or less the following:

We will create a Tensor class, which will be the data nodes of our computational graph.
We create the Operation class, which will be the operation nodes of our computational graph. Then we implement a bunch of operations like multiplication, addition, logarithm…
We test the engine by training a small model on some toy example we design.

The `Tensor` class

First things first, let’s add the data that our “data node”.

class Tensor:
    def __init__(self, data : float, requires_grad : bool = False) -> None:
        self.data = data                    # the actual data (only python float supported in this toy example)
        self.requires_grad = requires_grad  # whether we need to compute gradients for this node
        self.grad = None                    # gradient in that node
        self.grad_fn = None                 # pointer to the operation that produced this tensor

Piece of 🍰. Let’s now add the backprop function. We will split it into 2 functions:

_get_topological_sorted_graph: When called on tensor \(t\), this gives us a flattened list of all the tensors that were involved in the computation of \(t\). The result is given in topological order which means that for each directed edge \(e: u \mapsto v\), \(u\) will appear before \(v\) in the resulting list¹².
_backprop_grads: This function traverses the previous graph in reversed order (from newest to oldest tensor). And passes the gradients from one to the other. The backpropagated gradients get modified each time they traverse an “operation node” according to the nature of that operation.

¹² Aka “things are in order of appearance”

For now we will assume we have the Operation class implemented and we have a function that tels us how that operation alters the input gradients, and we have access to the Tensors that were inputs to that operation instance (aka parents).

class Tensor:
    def __init__(self, data : float, requires_grad : bool = False) -> None:
        self.data = data
        self.requires_grad = requires_grad
        self.grad = None
        self.grad_fn = None
    
    def _get_topological_sorted_graph(self) -> List["Tensor"]:
        """ Return list of tensors in the computational graph up until self.
            The list is sorted in topological order
            (aka from left to right, aka from oldest to newest)
        """
        topo: List[Tensor] = []
        visited: Set[int] = set()

        def build_topo(t: Tensor) -> None:
            tid = id(t)
            if tid in visited:
                return
            visited.add(tid)
            if t.grad_fn is None:
                return
            for p in t.grad_fn.parents:
                build_topo(p)
            topo.append(t)

        build_topo(self)
        return topo

    def _backprop_grads(self, topo : List["Tensor"]) -> None:
        """ Walk graph in reverse topological order 
            i.e. from right to left, pushing gradients backwards
        """
        for t in reversed(topo):
            if t.grad_fn is None:
                continue  # leaf tensor; nothing to propagate further
            if t.grad is None:
                continue  # no upstream flow reached this node

            # Local backward. Gradients of: ∂ operation / ∂ inputs
            parent_grads = t.grad_fn.backward(t.grad)

            for parent, local_grad in zip(t.grad_fn.parents, parent_grads):
                if local_grad is None:
                    continue
                if parent.requires_grad:
                    parent.grad = (parent.grad or 0.0) + float(local_grad)


    def backward(self, grad : float = 1.0) -> None:
        if not self.requires_grad:
            return

        # Get sorted list of tensors
        topo = self._get_topological_sorted_graph()

        # Seed gradient on the output
        self.grad = (self.grad or 0.0) + float(grad)
        
        # Backprop gradients
        self._backprop_grads(topo)

The `Operation` class

Amazing! Let’s now write an abstract class for a generic operation. The forward and backward methods will need to be implemented by each operation type as:

forward method: performs the operation on the tensors
backward method: defines how the gradients of each input a

We’ll create a class method for convenience which

class Operation(abc.ABC):
    def __init__(self, *parents):
        self.parents = parents  # pointers to the tensors that are the inputs to this operation
        
    @abc.abstractmethod
    def forward(self, *inputs):
        # What the operation does to the inputs
        pass
    
    @abc.abstractmethod
    def backward(self, *grads):
        # What the operation does to the gradients
        pass
    
    @classmethod
    def __call__(cls, *parents):
        operation = cls(*parents)  # instantiate the operation node
        out_data = operation.forward(*[p.data for p in parents])  # compute the output
        requires_grad = any(p.requires_grad for p in parents)  # check if output requires grad
        out = Tensor(out_data, requires_grad=requires_grad)  # create the output tensor
        if requires_grad:
            out.grad_fn = operation  # store the operation that produced this tensor
        return out

Let’s implement some operations!

Given the abstract class, we are now ready to implement operations like churros! For now let’s get some of the basics: Multiplication, Addition, Logarithm and Sigmoid.

class Multiplication(Operation):
    """
    z = a * b
    dz/da = b
    dz/db = a
    """
    
    def forward(self, a : Tensor, b : Tensor) -> float:
        return a.data * b.data
    
    def backward(self, upstream_grad):
        a, b = self.parents
        dz_da = b.data * upstream_grad
        dz_db = a.data * upstream_grad
        return (dz_da, dz_db) -> Tuple[float]

class Addition(Operation):
    """
    z = a + b
    dz/da = 1
    dz/db = 1
    """

    def forward(self, a : Tensor, b : Tensor) -> float:
        return a.data + b.data

    def backward(self, upstream_grad : float) -> Tuple[float]:
        a, b = self.parents
        dz_da = 1. * upstream_grad
        dz_db = 1. * upstream_grad
        return (dz_da, dz_db)

class Log(Operation):
    """
    f(x) = log(x)
    df/dx = 1 / x
    """
    def forward(self, x : Tensor) -> float:
        return math.log(x.data)

    def backward(self, upstream_grad : float) -> float:
        x = self.parents[0]
        df_dx = 1.0 / x.data
        return (upstream_grad * df_dx,)

class Sigmoid(Operation):
    """
    f(x) = 1 / (1 + exp(-x))
    df/dx = f(x) * (1 - f(x))
    """
    
    def _sigmoid(self, x : float) -> float:
        return 1.0 / (1.0 + math.exp(-x))
    
    def forward(self, x : Tensor) -> float:
        return self._sigmoid(x.data)
    
    def backward(self, upstream_grad : float) -> float:
        x = self.parents[0]
        df_dx = self._sigmoid(x.data) * (1.0 - self._sigmoid(x.data))
        return (upstream_grad * df_dx,)

Testing our engine

Perfect, our simple autodiff engine is ready! Now it is time to test it out, isn’t this very exciting? Yes it is.

We’ll be solving a classification problem of points in a 2D space with two labels, using Logistic Regression 🦕. We’ll follow the common ML problem flow:

Dataset: In this case we are going to synthetically generate some clusters and assign the two possible labels to them.
Model: We’ll go for the old trusty logistic regression: \(f(x) = \sigma(w_1 * x_1 + w_2 * x_2 + b)\). Aka 1-layer ANN with sigmoid activation function.
Loss: Binary cross-entropy loss should do the job 🤠
Optimizer: Vanilla gradient descent, let’s not overcomplicate stuff here.

Dataset

random.seed(42)
num_per_class = 30
sigma = 0.3
dataset = []
for _ in range(num_per_class):
    dataset.append({"x_1": random.gauss(-1.0, sigma), "x_2": random.gauss(-1.0, sigma), "y_true": 0})
    dataset.append({"x_1": random.gauss(1.0, sigma), "x_2": random.gauss(1.0, sigma), "y_true": 0})
    dataset.append({"x_1": random.gauss(-1.0, sigma), "x_2": random.gauss(1.0, sigma), "y_true": 0})
    dataset.append({"x_1": random.gauss(1.0, sigma), "x_2": random.gauss(-1.0, sigma), "y_true": 1})
random.shuffle(dataset)

Model

class LogisticRegressionModel:
    """ f(x) = sigmoid(w_1 * x_1 + w_2 * x_2 + b)
    """
    def __init__(self, w_1 : float = 1.0, w_2 : float = 1.0, b : float = 0.0):
        self.w_1 = Tensor(data=w_1, requires_grad=True)
        self.w_2 = Tensor(data=w_2, requires_grad=True)
        self.b = Tensor(data=b, requires_grad=True)
        self.params = [self.w_1, self.w_2, self.b]
    
    def forward(self, x_1 : Tensor, x_2 : Tensor) -> Tensor:
        z_1 = Multiplication.apply(self.w_1, x_1)
        z_2 = Multiplication.apply(self.w_2, x_2)
        z_3 = Addition.apply(z_1, z_2)
        z_4 = Addition.apply(z_3, self.b)
        y = Sigmoid.apply(z_4)
        return y
    
    def zero_grad(self):
        for param in self.params:
            param.grad = 0
            param.grad_fn = None

Loss

class BCELoss:
    """ L(y_pred, y_true) = - (y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred))
    """

    def __init__(self):
        self.one = Tensor(data=1.0, requires_grad=False)
        self.minus_one = Tensor(data=-1.0, requires_grad=False)

    def forward(self, y_pred : Tensor, y_true : int) -> Tensor:
        if y_true == 1:
            l1 = Log.apply(y_pred)
            l = Multiplication.apply(self.minus_one, l1)
            return l
        if y_true == 0:
            l1 = Multiplication.apply(self.minus_one, y_pred)
            l2 = Addition.apply(l1, self.one)
            l3 = Log.apply(l2)
            l = Multiplication.apply(self.minus_one, l3)
            return l
        raise ValueError("y_true must be 0 or 1")

Optimizer

class GradientDecentOptimizer:

    def __init__(self, parameters: list, lr : float = 0.01):
        self.parameters = parameters
        self.lr = lr
    
    def step(self):
        for param in self.parameters:
            param.data = param.data - self.lr * param.grad

Trainer

We instantiate the model, loss and optimizer and perform the training loop.

model = LogisticRegressionModel(w_1=-1.0, w_2=1.0, b=2.0)  # SOme random initial values
loss_fn = BCELoss()
optimizer = GradientDecentOptimizer(model.params, lr=0.1)


for epoch in range(1):
    print("\nEPOCH:", epoch)
    for n_ex, example in enumerate(dataset):
        model.zero_grad()

        x_1 = Tensor(data=example["x_1"], requires_grad=True)
        x_2 = Tensor(data=example["x_2"], requires_grad=True)

        y_pred = model.forward(x_1, x_2)

        y_true = example["y_true"]
        
        loss = loss_fn.forward(y_pred, y_true); loss.name = "loss"; loss.role = "output"
                    
        loss.backward()
        optimizer.step()

if we run this, we can see that the logistic regression indeed learns to model our data 😀:

Decision boundary evolution during learning. looking cool AF 😎 (this should be a gif that moves, not sure if it’ll work)

I also thought it’d be interesting to visualize the overall computational graph. Here we can see the stored values after a forward and backward pass:

Computational graph of the model and loss at the last training step. Color coding goes like this: 🟠 model learnable parameters, 🟢 input variables, 🔵 intermediate variables, 🟣 outputs.

Cool stuff, cool stuff. Anyway, I think it’s gonna be a wrap for today. Truth is I wanted to refresh my memory on backpropagation and some of the basics and got a bit carried away (as it happens). Next steps (which I’d love to do but I doubt I find the time) include implementing this for multi-dimensional tensors. Let’s see if one day I get heat up and I end up writing it 🔥.