This is a very shallow post I use as reference to other posts just to make sure we are on the same page on understanding and notation around Supervised Learning.
How are models “trained”?
Imagine we are given a dataset of input-outputs pairs1:
1 Here \(x\) and \(y\) can be anything: image \(\leftrightarrow\) tag, audio \(\leftrightarrow\) transcription, tokens \(\leftrightarrow\) next_token, …
\[ \mathcal{D} = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\} \]
We are asked (quite unsurprisingly) to find a decent mapping \(x \rightarrow y\). We do so by first assuming a functional form2 \(f\) of the mapping dependent on some parameters \(W\)3:
2 Fancy way of saying “a sequence of operations”
3 The more the merrier 🤪
\[ y = f(x; W) \]
To find the best parameters according to the data, we define an error metric (loss function) which we wanna optimize on. If working on regression it could be MSE. If working on classification, it could be cross-entropy, or any other for that matter:
\[ \mathcal{L} (f, \mathcal{D}, W) \]
Now we need to find a global optima of \(\mathcal{L}\) wrt \(W\). This translates into finding a zero in the gradient function Tip 1. Usually however4, the analytical expression of is too complex to derive. Luckily, we can approximate it numerically at some points (provided some input-output pair).
4 Unless very carefully choosing \(f\), which often makes its modelling power very limited.
We usually start with “random” parameters \(W\) and we do some variation of gradient descent to find the ones which minimize the loss. The idea is we iteratively do steps towards a local optima until convergence:
\[ W^{k+1} \leftarrow W^{k} - \eta \cdot \nabla_W \mathcal{L} \left( f, \mathcal{D}, W^k \right) \]
Where \(\eta\) is the learning rate, which tell us how big of a step we do.
The gradient \(\nabla_x f(x) = \left( \frac{\partial f}{\partial x_1}, ..., \frac{\partial f}{\partial x_n} \right)\) is the generalization to multi-dimensional inputs of the derivative \(\frac{\partial f(x)}{\partial x}\). The direction of the vector indicates the direction of maximum ascent of \(f\) from a given point.
What I explained previously falls within the MLE paradigm in probabilistic modelling.
Before understanding the differences it is important to keep in mind the Bayes theorem:
\[ p \left(Y \mid X \right) = \frac{p \left(X \mid Y \right) \space p \left(Y \right)}{p \left(X \right)} \]
Naming goes:
- \(p \left(Y \mid X \right)\) posterior
- \(p \left(X \mid Y \right)\) likelihood
- \(p \left(Y \right)\) prior
- \(p \left(X \right)\) evidence
Differences:
- MLE (Maximum Likelihood Estimation) finds the parameters which better explain the data, the ones that give a highest likelihood:
\[ \theta^\star = \arg \max_\theta \mathcal{L} \left( \mathcal{D}, \theta \right) \]
- MAP (Maximum a Posteriori) works within the Bayesian framework and, instead of maximizing the likelihood, maximizes the posterior. It finds the best parameters according to the data. For this, we need to assume some prior distribution over the parameters \(p(\theta)\).
\[ \theta^* = \arg\max_{\theta} p(\theta \mid D) = \arg\max_{\theta} \frac{p(D \mid \theta)p(\theta)}{p(D)} \propto \arg\max_{\theta} p(D \mid \theta)p(\theta) \]
Notice that the probability of the observed data is independent from the model parameters. Thus, we do not need to consider it for the MAP computation.
In addition, this can be linked to some regularization techiques, depending on the prior chosen.
Pytorch implementation
The previous logic is abstracted by deep learning libraries like so:
for inputs, labels in enumerate(training_loader):
# Zero your gradients for every batch
optimizer.zero_grad()
# Make predictions for this batch
outputs = model(inputs)
# Compute the loss
loss = loss_fn(outputs, labels)
# Compute each parameter's gradients wrt the loss
loss.backward()
# Add gradients to obtained parameters
optimizer.step()