Deep Supervised Learning, oversimplified

Assume a functional form of the mapping dependent on some parameters, find the best parameters.
Author

Oleguer Canal

Published

April 30, 2024

Warning

This is a very shallow post I use as reference to other posts just to make sure we are on the same page on understanding and notation around Supervised Learning.

How are models “trained”?

Imagine we are given a dataset of input-outputs pairs1:

1 Here x and y can be anything: image tag, audio transcription, tokens next_token, …

D={(x1,y1),(x2,y2),...,(xn,yn)}

We are asked (quite unsurprisingly) to find a decent mapping xy. We do so by first assuming a functional form2 f of the mapping dependent on some parameters W3:

2 Fancy way of saying “a sequence of operations”

3 The more the merrier 🤪

y=f(x;W)

To find the best parameters according to the data, we define an error metric (loss function) which we wanna optimize on. If working on regression it could be MSE. If working on classification, it could be cross-entropy, or any other for that matter:

L(f,D,W)

Now we need to find a global optima of L wrt W. This translates into finding a zero in the gradient function Tip 1. Usually however4, the analytical expression of is too complex to derive. Luckily, we can approximate it numerically at some points (provided some input-output pair).

4 Unless very carefully choosing f, which often makes its modelling power very limited.

We usually start with “random” parameters W and we do some variation of gradient descent to find the ones which minimize the loss. The idea is we iteratively do steps towards a local optima until convergence:

Wk+1WkηWL(f,D,Wk)

Where η is the learning rate, which tell us how big of a step we do.

The gradient xf(x)=(fx1,...,fxn) is the generalization to multi-dimensional inputs of the derivative f(x)x. The direction of the vector indicates the direction of maximum ascent of f from a given point.

What I explained previously falls within the MLE paradigm in probabilistic modelling.

Before understanding the differences it is important to keep in mind the Bayes theorem:

p(YX)=p(XY) p(Y)p(X)

Naming goes:

  • p(YX) posterior
  • p(XY) likelihood
  • p(Y) prior
  • p(X) evidence

Differences:

  • MLE (Maximum Likelihood Estimation) finds the parameters which better explain the data, the ones that give a highest likelihood:

θ=argmaxθL(D,θ)

  • MAP (Maximum a Posteriori) works within the Bayesian framework and, instead of maximizing the likelihood, maximizes the posterior. It finds the best parameters according to the data. For this, we need to assume some prior distribution over the parameters p(θ).

θ=argmaxθp(θD)=argmaxθp(Dθ)p(θ)p(D)argmaxθp(Dθ)p(θ)

Notice that the probability of the observed data is independent from the model parameters. Thus, we do not need to consider it for the MAP computation.

In addition, this can be linked to some regularization techiques, depending on the prior chosen.

Pytorch implementation

The previous logic is abstracted by deep learning libraries like so:

for inputs, labels in enumerate(training_loader):
    # Zero your gradients for every batch
    optimizer.zero_grad()

    # Make predictions for this batch
    outputs = model(inputs)

    # Compute the loss
    loss = loss_fn(outputs, labels)

    # Compute each parameter's gradients wrt the loss
    loss.backward()

    # Add gradients to obtained parameters
    optimizer.step()