We are going to talk about the following:

Regression: L1, L2, Huber
Classification: NLL, Cross-Entropy, KL, Focal, Hinge
Distance: Pairwise contrastive, Triplet contrastive, Angular, N-pair, InfoNCE, Proxy-Based

Coming soon I’ll add losses for: Ranking, Sequence, Generative, Structured, and Reinforcement learning.

Regression

For when we wanna predict a continuous numeric value for each input. Given a dataset \(\mathcal{D} = \{ (x_i, y_i)\}_{i=1:N}\), where \(y_i \in \mathbb{R}\) is the real value to guess and \(\hat{y} \in \mathbb{R}\) the predicted value.

L1 (MAE)

\[ L(y, \hat{y}) = | y - \hat{y} | \]

This is as simple as this can get, however it has a couple of problems:

The gradient is always either \(1\) or \(-1\), ideally we would want smaller gradients close to the optima so that optimization steps aren’t as extreme.
It is non-differentiable at the optima.

L2 (MSE)

\[ L(y, \hat{y}) = | y - \hat{y} |^2 \]

Things to note:

This solves the previous problems of adaptability¹ and differentiability.
It introduces a new problem however: Very bad guesses now have a huge gradient which can unstabilize training.
Both L1 and L2 losses are special cases of the more generic \(L_p\) loss: p-norm of the error vector.

¹ Now the gradient is linearly proportional to how bad the guess is.

Huber

Gets the best of both L1 and L2 by combining them dependoing on how far-off the guess is²:

² Usually \(\delta \approx 1\).

\[ L(y, \hat{y}; \delta) = \begin{cases} | y - \hat{y} |^2 & \text{if } | y - \hat{y} | < \delta \\ | y - \hat{y} | & \text{else} \end{cases} \]

Classification

For when we wanna assign each input to one of a fixed set of categories or labels. Here I’ll mostly go over pytorch implemented losses to have some kind of guidence and explain those. As we’ll see most of them are equivalent one-another and some allow for some extra gimmicks.

Note 1: Where do all those come from?

There are two (equivalent) ways of looking into this:

Through Negative Log-Likelihood (NLL) minimization lenses
Through Kullback-Leibler divergence (KL) minimization lenses

Negative Log-Likelihood (NLL)

This is closer to frequentist statistics / ML point of view. Working within the MLE paradigm, we aim to find the parameters which better explain the data:

\[ \max_\theta p(\mathcal{D} \mid \theta) \]

If the dataset we are given is \(\mathcal{D} = \{ x_i, y_i \}_{i=1:N}\) (we assume i.i.d. ) and our model is \(q_\theta\), we have that the likelihood of this data is given by:

\[ L(\theta) = \prod_{i=1:N} q_\theta (y_i \mid x_i ) \]

We want to maximize this \(L(\theta)\) (since we are doing MLE haha). Taking logarithms for the usual numerical reasons and minimizing the negative (since most optimization packages work on function minimization):

\[ \min_\theta - \sum_{i=1:N} \log q_\theta (y_i \mid x_i ) \]

This is known as Negative Log-Likelihood (NLL).

How do we get something like BCE from this?

We can see BCE as a special case of NLL where the model operates as a \(\text{Bernoulli}\) distribution³.

In binary classification \(y \in \{0, 1\}\). Our model tells us:

\[ q_\theta(y = 1 \mid x) \]

Conversely, \(q_\theta(y = 0 \mid x) = 1 - q_\theta(y = 1 \mid x)\). We can see this as a \(\text{Bernoulli}\) distribution of parameter \(\lambda = q_\theta(y = 1 \mid x)\). Thus, if we wanna write \(q_\theta (y \mid x)\) as a single expression:

\[ q_\theta (y \mid x) = \text{Bernoulli} (y; \lambda) = \lambda^y (1 - \lambda)^{1 - y} \]

If we take the negative log of a single datapoint of the dataset:

\[ - \log q_\theta (y \mid x) = - \left[\underbrace{y \cdot \log (q_\theta(y = 1 \mid x))}_{\text{activates when } y=1} + \underbrace{(1 - y) \cdot \log(1 - q_\theta(y = 1 \mid x))}_{\text{activates when } y=0} \right] \]

Kullback Leibler Divergence (KL)

This is closer to an information theory point of view. We are given samples of a distribution \(p\) and we try to model it with a parametrized distribution \(q_\theta\). For this, we try to minimize a distance between both distributions (\(p\) and \(q\)). Kullback Leibler Divergence is a common statistical distance between distributions.

\[ \begin{split} D_{KL} \left( p \mid \mid q_\theta \right)\\ &= E_P(I_{q_\theta} - I_p)\\ &= E_P(I_{q_\theta}) - E_P(I_p)\\ &= \mathcal{H} (p, q_\theta) - \underbrace{\mathcal{H} (p)}_{constant}\\ % &\approx ... \end{split} \]

Where since \(p\) is given and doesn’t depend on \(\theta\) we can ignore its entropy \(\mathcal{H} (p)\). Remember that the cross-entropy between two distributions is defined as:

\[ \mathcal{H} (P, Q) = E_P (I_Q) = - \sum_x p(P=x) \cdot \log(p(Q = x)) \]

Where we used that:

Expectancy of a function \(f\) of event \(X\) under distribution \(P\) is: \(E_P (f(X)) = \sum_{x} f(x) \cdot p(P=x)\)
Information content of event \(X\) under distribution \(Q\) is: \(I_Q (X) = - \log(p(Q=x))\)

Thus, we have that:

\[ \begin{split} \arg \min_\theta D_{KL} \left( p \mid \mid q_\theta \right) &= \arg \min_\theta \mathcal{H} (p, q_\theta)\\ &= - \sum_x p(P=x) \cdot \log(p(Q = x)) \end{split} \]

Re-writing it in more ML-friendly terms. Let \(p(y \mid x)\) be the true distribution of the labels, and \(q_\theta (y \mid x) \equiv p(Q = y \mid x)\) our model predictions. We get that

\[ \arg \min_\theta D_{KL} \left( p \mid \mid q_\theta \right) = \arg \min_\theta - \sum_{k \in \text{classes}} p(y =k \mid x) \cdot \log(q_\theta (y = k \mid x)) \]

Notice this is equivalent to what we derived before if \(p(y)\) can only take two values: \(y \in \{0, 1\}\), since we usually do one-hot encoding of the correct class⁴.

Final note: The theoretical definition of KL Divergence uses an expectation \(E_p\) over the true data distribution, which is unknown in practice. To solve this, we approximate the expectation using the empirical average over our observed dataset samples (Monte Carlo approximation). This explains why minimizing the finite sum in NLL is statistically equivalent to minimizing the theoretical expectation in KL.

⁴ Note we are not limited by this and we can easily generalize to multi-class classification.

³ Categorical cross-entropy is analogous but operating on a \(\text{Categorical}\) distribution instead of a \(\text{Bernoulli}\) one.

NLL

Let \(\hat{y}\) be the logsoftmaxed output logits of our model and \(y\) the correct class⁵. NLL [minimizes the negative]⁶ guessed value of the correct coordinate:

⁵ Important: Here I describe NLL as implemented by pytorch for categorical data where each output is a single ID: \(y \in {1, \dots, C}\) (no soft labels, no multi-label, no regression). As discussed before in Tip 1, the NLL paradigm is much more general than this.

⁶ Aka “maximizes the”

\[ \begin{split} L(y, \hat{y}) &= - \hat{y}_y\\ &\equiv - \sum_{k=1:C} \mathbb{1}_{y=k} \cdot \hat{y}_k\\ \end{split} \]

Notes:

Remember \(\text{logsoftmax(x)}_i = \log \left( \frac{e^{x_i}}{\sum_j e^{x_j}} \right)\).⁷
Since we ran the \(\text{softmax}\) function, this not only enforces this value to grow, but also forces the incorrect classes guesses to decrease.
Most implementations add a weight factor to compensate for class imbalance: \(L(y, \hat{y}) = - w_y \hat{y}_y\)

⁷ This is in practice more efficiently and safely implemented.

Cross-Entropy

Let \(\hat{z}\) be the unnormalized output logits of our model, the Pytorch implementation computes: \[ \begin{split} L(y, \hat{z}) &= - \text{logsoftmax}_y (\hat{z}) \\ &\equiv - \sum_{k=1:C} \mathbb{1}_{y=k} \cdot \log ( \text{softmax}_k (\hat{z}) ) \end{split} \]

Notes:

As we saw on Tip 1, running Cross-Entropy Loss is equivalent to applying the \(\text{logsoftmax(z)}\) on your logits and then running NLL Loss.
If we are working with binary classification, we usually use the equivalent Binary Cross-Entropy Loss (BCELoss). Given our guess \(\sigma(\hat{z})\) (so it is re-scaled in the \([0, 1]\) range) we have that \(L(y, \sigma(\hat{z})) = y \log(\sigma(\hat{z})) + (1 - y) \log ( 1 - \sigma(\hat{z}))\). Notice that this allows for soft labels.

KL Divergence

Computes the point-wise KL-divergence. Let \(\hat{y} \in \mathbb{R}_{[0, 1]}^{\text{BATCH} \times C}\) the guessed probabilities and \(y \in \mathbb{R}_{[0, 1]}^{\text{BATCH} \times C}\) the distribution we are trying to match, we have that: \[ L(y, \hat{y}) = y \cdot (\log (\hat{y}) - \log (y)) \]

Notes:

Remember \(D_{KL} \left( p \mid \mid q \right) = E_p(I_q - I_p) \equiv \frac{1}{N} \sum_i y_i \cdot ( \log(\hat{y}) - \log(y) )\). In the actual pytorch implementation they make some mess changing input by target and requiring one to be \(\log\) and not the other, read carefully when using.

Focal

This is a modification of the cross-entropy loss designed to handle class imbalance and hard-vs-easy examples. The core idea is to down-weight easy examples so that they don’t dominate the learning and help the model “focus” on the hard ones. Let \(\hat{y}\) be the normalized logits outputed by the model:

\[ L(y, \hat{y}) = - \sum_{k=1:C} \alpha_c ( 1 - \hat{y}_c)^\gamma \space \mathbb{1}_{y=k} \cdot \log (\hat{y}_c ) \]

Where:

\(\alpha_c\) is the class weight. Same as before, this is used to compensate for unbalanced datasets.
The focussing term⁸ \(( 1 - \hat{y}_c )^\gamma \xrightarrow[\hat{y}_c \rightarrow 1]{} 0\) makes the loss smaller if the model is very confident. This makes that easy examples contribute less to the learning. The opposite happens for under-confident or wrong guesses.
\(\gamma > 0\) is the “focussing parameter” (usually \(\approx 2\)).

⁸ I would call it de-focussing term but whatever.

Tip 2: Visualization

Here we can clearly see the effect of the de-focussing term for highly-confident guesses.

Code

import plotly.graph_objects as go
import numpy as np

alpha = 1
gamma = 2

# We plot the loss for a single class (the correct one)
# as the predicted probability (y_hat) goes from 0.1 to 1
y_hat = np.linspace(0.05, 1, 100) 

# Formula: L = - alpha * (1 - y_hat)^gamma * log(y_hat)
loss_focal = - alpha * np.power(1 - y_hat, gamma) * np.log(y_hat)
loss_ce = - alpha * np.log(y_hat) # equivalent to gamma=0

fig = go.Figure()
fig.add_trace(go.Scatter(x=y_hat, y=loss_focal, mode='lines', name=f'Focal Loss (gamma={gamma})'))
fig.add_trace(go.Scatter(x=y_hat, y=loss_ce, mode='lines', name='Cross Entropy (gamma=0)'))

fig.update_layout(
    title=f"Focal Loss (alpha={alpha})",
    xaxis_title='Predicted Probability (y_hat)',
    yaxis_title='Loss',
    margin=dict(l=20, r=20, t=30, b=20),
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="right",
        x=0.99
    )
)
fig.show()

Hinge

The Hinge loss introduces the idea of not only optimizing for “correctness” but “confident correctness”. It does so by enforcing a margin between classes. In its simplest form (SVMs Tip 4) it operates for binary classification where \(y \in \{ -1, 1\}\) with a margin of 1 unit:

\[ L(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y}) \]

Tip 3: Visualization

It is called “hinge loss” because it looks like a hinge.

Code

import plotly.graph_objects as go
import numpy as np

x = np.linspace(-2, 2, 400)
loss_y1 = np.maximum(0, 1 - x)
loss_ym1 = np.maximum(0, 1 + x)

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=loss_y1, mode='lines', name='y=1', line=dict(color='green')))
fig.add_trace(go.Scatter(x=x, y=loss_ym1, mode='lines', name='y=-1', line=dict(color='blue')))

fig.update_layout(
    title="Hinge Loss",
    xaxis_title='Guess (y_hat)',
    yaxis_title='Loss',
    margin=dict(l=20, r=20, t=30, b=20),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)
fig.show()

If y = 1:
- If the guess is 1 or over: The loss is 0.
- If the guess is under 1: The loss increases linearly on how far off we are

(analogous for when the label is -1).

This idea, as we’ll see, is used by many subsequent loss functions⁹.

⁹ It was primarily used in SVMs

Note 4: SVM memory refresher

SVMs are a simple model to find the hyperplane which better splits the a 2-class dataset. It is framed as a convex optimization problem with the objective of finding the largest margin within the two classes of points.

\[ \vec{w}^T \cdot \vec{x} \rightarrow \begin{cases} \text{class} \space \space 1 \qquad & if \geq 1\\ \text{class} \space -1 \qquad & if \le -1 \end{cases} \]

In ANN terms, it can be seen as: no-bias linear layer + Hinge Loss.

NOTE: Since it is quite unlikely that the data is linearly-spearable we usually use a kernel to project it to a higher-dim space. Often making use of the kernel trick for computational reasons.

Distance

For when we wanna learn a representation¹⁰ where similar items are close and dissimilar items are far apart. Notice that the term “contrastive loss” is overloaded in the literature and may refer to different things. I’ll try to highlight the most relevant ideas but keep in mind that naming might be different in other places.

¹⁰ Aka embedding.

Pairwise contrastive

Given a pair of training examples \((x_i, x_j)\) and the label \(y = \begin{cases} 1 \space \text{if similar}\\ 0 \space \text{else} \end{cases}\), and a minimum margin \(m\).

Let \(d = \text{distance}(f_\theta(x_i), f_\theta(x_j))\), we have that:

\[ L(y, d) = y \cdot \underbrace{d^2}_{\text{if similar}} + (1-y) \cdot \underbrace{\max(0, m - d)^2}_{\text{if different}} \]

If the elements should be similar \((y=1)\), loss augments quadratically with distance.
If elements should be dissimilar \((y=0)\):
- Loss is 0 beyond the given margin¹¹.
- Loss augments quadratically with inverse distance.

¹¹ Not to push them infinitely apart.

Tip 5: Visualization

Here we can see how the loss behaves for similar and dissimilar pairs.

Code

import plotly.graph_objects as go
import numpy as np

m = 2
d = np.linspace(0, 3, 300)

# Similar case (y=1): L = d^2
loss_similar = d**2

# Dissimilar case (y=0): L = max(0, m - d)^2
loss_dissimilar = np.maximum(0, m - d)**2

fig = go.Figure()
fig.add_trace(go.Scatter(x=d, y=loss_similar, mode='lines', name='Similar (y=1)', line=dict(color='green')))
fig.add_trace(go.Scatter(x=d, y=loss_dissimilar, mode='lines', name='Dissimilar (y=0)', line=dict(color='red')))

fig.update_layout(
    title=f"Pairwise Contrastive Loss (m={m})",
    xaxis_title='Distance (d)',
    yaxis_title='Loss',
    margin=dict(l=20, r=20, t=30, b=20),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)
fig.show()

Green line (similar): The loss increases quadratically with distance, encouraging similar items to be close together.
Red line (dissimilar): The loss is 0 when distance ≥ margin (m=2), but increases quadratically as distance decreases below the margin, encouraging dissimilar items to be at least m units apart.

Triplet contrastive

Same idea as before, but now we work on triplets of elements:

Anchor: \(\vec{a} = f_\theta (x)\): Where \(x\) is an example sampled from the training set.
Positive: \(\vec{p} = f_\theta (x_p)\): Where \(x_p\) is an example with same class as anchor.
Negative: \(\vec{n} = f_\theta (x_n)\): Where \(x_n\) is an example with different class as anchor.

Let \(\alpha\) be some margin and \(d\) some distance function:

\[ L(\vec{a}, \vec{p}, \vec{n}) = \max \left(0, d(\vec{a}, \vec{p}) - d(\vec{a}, \vec{n}) + \alpha \right) \]

We want the difference between distances to be bigger than \(\alpha\). If this condition is met, the loss is zero so that embeddings don’t continue diverging.

\[ d(\vec{a}, \vec{n}) - d(\vec{a}, \vec{p}) \geq \alpha \]

Angular

Same as triplet contrastive but working in angular space (even the margin). Let:

\(\theta_{a,p}\) be the angle between the anchor and positive example.
\(\theta_{a,n}\) be the angle between the anchor and negative example.

\[ L(\theta_{a,p}, \theta_{a,n}) = \max\left(0,\ \cos\theta_{a,p} - \cos(m + \theta_{a,n})\right) \]

N-pair

Generalizes triplet loss to one positive and many negatives. Encourages correct pair to have higher similarity than all negatives. Let:

Anchor: \(\vec{a} = f_\theta (x)\): Where \(x\) is an example sampled from the training set.
Positive: \(\vec{p} = f_\theta (x_p)\): Where \(x_p\) is an example with same class as anchor.
Negatives: \(\vec{n_i} = f_\theta ({x_n}_i)\): Where \({x_n}_i\) are different class examples wrt the anchor.

\[ L(a, p, \{ n_i \}_i) = \log \left( 1 + \sum_i e^{\vec{a}^T \cdot \vec{n_i} - \vec{a}^T \cdot \vec{p}} \right) \]

To minimize this function, you want \(\vec{a}^T \cdot \vec{n_i} < \vec{a}^T \cdot \vec{p} \quad \forall i\). I.e. the cosine similarity between the anchor and positive to be higher than the anchor and negatives.

InfoNCE

InfoNCE (aka Information Noise-Contrastive Estimation) is similar to N-Pair: Maximize similarity of the positive pair among many negatives.

Let:

\(\text{sim}\) be some similarity function¹².
\(\tau\) some temperature parameter.

¹² For instance cosine.

\[ L(a, p, \{ n_i \}_i) = - \log \frac{e^{\frac{\text{sim}(a, p)}{\tau}}}{ e^{\frac{\text{sim} (a, p)}{\tau}} + \sum_i e^{\frac{\text{sim}(a, n_i)}{\tau}}} \]

This can be seen as performing NLL on a classification result where the positive example is the correct class and the rest are the negatives. Essentially, we are doing \(\text{softmax}\) of the similarities.

Proxy-Based

Every class has a learned proxy vector associated with it \(\vec{p}_c\) (class embedding). We basically apply NLL to the softmaxed distances between our guess and all these class representatives:

\[ L(y, \vec{\hat{y}}) = - \log \frac{e^{- \|\vec{\hat{y}} - \vec{p}_y \|^2}}{\sum_c e^{- \|\vec{\hat{y}} - \vec{p}_c \|^2}} \]