Warning

This is a very shallow post I use as reference to other posts just to make sure we are on the same page on understanding and notation around Supervised Learning.

How are models “trained”?

Imagine we are given a dataset of input-outputs pairs¹:

¹ Here $x$ and $y$ can be anything: image $\leftrightarrow$ tag, audio $\leftrightarrow$ transcription, tokens $\leftrightarrow$ next_token, …

$D = {(x_{1}, y_{1}), (x_{2}, y_{2}), . . ., (x_{n}, y_{n})}$

We are asked (quite unsurprisingly) to find a decent mapping $x \to y$ . We do so by first assuming a functional form² $f$ of the mapping dependent on some parameters $W$ ³:

² Fancy way of saying “a sequence of operations”

³ The more the merrier 🤪

$y = f (x; W)$

To find the best parameters according to the data, we define an error metric (loss function) which we wanna optimize on. If working on regression it could be MSE. If working on classification, it could be cross-entropy, or any other for that matter:

$L (f, D, W)$

Now we need to find a global optima of $L$ wrt $W$ . This translates into finding a zero in the gradient function Tip 1. Usually however⁴, the analytical expression of is too complex to derive. Luckily, we can approximate it numerically at some points (provided some input-output pair).

⁴ Unless very carefully choosing $f$ , which often makes its modelling power very limited.

We usually start with “random” parameters $W$ and we do some variation of gradient descent to find the ones which minimize the loss. The idea is we iteratively do steps towards a local optima until convergence:

$W^{k + 1} \leftarrow W^{k} - η \cdot \nabla_{W} L (f, D, W^{k})$

Where $η$ is the learning rate, which tell us how big of a step we do.

Tip 1: What is a gradient?

The gradient $\nabla_{x} f (x) = (\frac{\partial f}{\partial x_{1}}, . . ., \frac{\partial f}{\partial x_{n}})$ is the generalization to multi-dimensional inputs of the derivative $\frac{\partial f (x)}{\partial x}$ . The direction of the vector indicates the direction of maximum ascent of $f$ from a given point.

Tip 2: MLE vs MAP

What I explained previously falls within the MLE paradigm in probabilistic modelling.

Before understanding the differences it is important to keep in mind the Bayes theorem:

$p (Y ∣ X) = \frac{p (X ∣ Y) p (Y)}{p (X)}$

Naming goes:

$p (Y ∣ X)$ posterior
$p (X ∣ Y)$ likelihood
$p (Y)$ prior
$p (X)$ evidence

Differences:

MLE (Maximum Likelihood Estimation) finds the parameters which better explain the data, the ones that give a highest likelihood:

$θ^{⋆} = \arg max_{θ} L (D, θ)$

MAP (Maximum a Posteriori) works within the Bayesian framework and, instead of maximizing the likelihood, maximizes the posterior. It finds the best parameters according to the data. For this, we need to assume some prior distribution over the parameters $p (θ)$ .

$θ^{*} = \arg max_{θ} p (θ ∣ D) = \arg max_{θ} \frac{p (D ∣ θ) p (θ)}{p (D)} \propto \arg max_{θ} p (D ∣ θ) p (θ)$

Notice that the probability of the observed data is independent from the model parameters. Thus, we do not need to consider it for the MAP computation.

In addition, this can be linked to some regularization techiques, depending on the prior chosen.

Pytorch implementation

The previous logic is abstracted by deep learning libraries like so:

for inputs, labels in enumerate(training_loader):
    # Zero your gradients for every batch
    optimizer.zero_grad()

    # Make predictions for this batch
    outputs = model(inputs)

    # Compute the loss
    loss = loss_fn(outputs, labels)

    # Compute each parameter's gradients wrt the loss
    loss.backward()

    # Add gradients to obtained parameters
    optimizer.step()