LASSO vs Ridge regularization – Bocachancla 🫦🩴

Shrinkage: Technique to prevent overfitting by adding a penalty for large coefficients to the loss function¹.

¹ Trade bias for variance

LASSO and Ridge are two regularization techniques that achieve this by adding a ~ lagrange multiplier to the loss function, penalizing large weights.

In particular²:

² \(\lambda\) controls how strong the regularization applied is: High \(\lambda \implies\) High bias and Low variance (and vice versa for low \(\lambda\)).

LASSO adds an \(L_1\) norm penalty to the loss function³:

³ Remember \(\left\| W \right\|_1 = \sum_{i=1}^n |w_i|\)

\[ \mathcal{L}_{\text{LASSO}} = \mathcal{L} + \lambda \left\| W \right\|_1 \]

Ridge adds an \(L_2\) norm penalty to the loss function⁴:

⁴ Remember \(\left\| W \right\|_2^2 = \sum_{i=1}^n w_i^2\)

\[ \mathcal{L}_{\text{Ridge}} = \mathcal{L} + \lambda \left\| W \right\|_2^2 \]

Where \(\lambda\) is the regularization parameter.

MLE vs MAP

It is educational to link these regularization techniques to the frequentist vs Bayesian perspective of model optimization.

We usually think of model optimization from a frequentist perspective: We apply Maximum Likelihood Estimation (MLE) to find the parameters that best describe the data:

\[ \max_\theta p (\mathcal{D} \mid \theta) \]

However, we could also take a Bayesian perspective and apply Maximum a Posteriori (MAP) optimization. In which case, we would be optimizing over the posterior⁵:

⁵ I remove the denominator (aka evidence) as it is independent from the parameters

\[ \max_\theta p (\theta \mid \mathcal{D}) \propto \max_\theta p (\mathcal{D} \mid \theta) \cdot p (\theta) \]

Notice that now we have to assume some prior distribution over the parameters \(p(\theta)\). Interestingly, this prior distribution can be seen as a regularization term in the loss function.

For instance, if we assume a Gaussian prior distribution over the parameters \(p(\theta) = \mathcal{N}(\theta \mid 0, \sigma^2)\), we get the following loss function: \[ \mathcal{L}_{\text{MAP}} = \mathcal{L} + \lambda \left\| W \right\|_2^2 + \text{const} \]

Where \(\lambda = \frac{1}{2\sigma^2}\) and \(\text{const}\) is a constant term independent from the parameters.

This means that we can see the MLE optimization + Ridge regularization as a MAP optimization with a Gaussian prior distribution.

Similarily, if we assume a Laplace prior distribution over the parameters \(p(\theta) = \text{Laplace}(\theta \mid 0, \sigma^2)\), we get the following loss function: \[ \mathcal{L}_{\text{MAP}} = \mathcal{L} + \lambda \left\| W \right\|_1 + \text{const} \]

Where \(\lambda = \frac{1}{\sigma^2}\) and \(\text{const}\) is a constant term independent from the parameters.

This means that we can see the MLE optimization + LASSO regularization as a MAP optimization with a Laplace prior distribution.

LASSO visualized

LASSO⁶ encourages sparsity in the models parameters: It forces close-to-zero parameters to zero, making the model only use vital features⁷.

⁶ Least Absolute Shrinkage and Selection Operator

⁷ This is why it is called “selection operator”: it performs feature selection.

If your model has two parameters \(w_1\) and \(w_2\), LASSO penalization would look like this:

Code

import plotly.graph_objects as go
import numpy as np

w1 = np.arange(-2, 2, 0.1)
w2 = np.arange(-2, 2, 0.1)
W1, W2 = np.meshgrid(w1, w2)
Z = np.abs(W1) + np.abs(W2)

# Create surface plot
fig = go.Figure(data=[go.Surface(x=W1, y=W2, z=Z)])

# Add axis lines (w1=0, w2=0, z=0)
# w1-axis line (w2=0, z=0)
fig.add_trace(go.Scatter3d(x=[-2, 2], y=[0, 0], z=[0, 0], 
                          mode='lines', line=dict(color='black', width=5), 
                          showlegend=False))
# w2-axis line (w1=0, z=0)
fig.add_trace(go.Scatter3d(x=[0, 0], y=[-2, 2], z=[0, 0], 
                          mode='lines', line=dict(color='black', width=5), 
                          showlegend=False))
# Z-axis line (w1=0, w2=0)
z_max = np.max(Z)
fig.add_trace(go.Scatter3d(x=[0, 0], y=[0, 0], z=[0, z_max], 
                          mode='lines', line=dict(color='black', width=5), 
                          showlegend=False))

fig.update_layout(
    title="Lasso regularization",
    scene=dict(
        xaxis_title='w1',
        yaxis_title='w2',
        zaxis_title='z'
    )
)
fig.show()

Ridge visualized

Ridge regression encourages all weights to be smaller equally. Most modern optimizers use a variant of Ridge regularization called “weight decay”. For instance, in pytorch we would add Ridge regularization as such:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.01)

The float we pass to weight_decay is the \(\lambda\) parameter of the Ridge regularization.

If your model has two parameters \(w_1\) and \(w_2\), Ridge penalization would look like this:

Code

import plotly.graph_objects as go
import numpy as np

w1 = np.arange(-1, 1, 0.1)
w2 = np.arange(-1, 1, 0.1)
W1, W2 = np.meshgrid(w1, w2)
Z = W1**2 + W2**2

# Create surface plot
fig = go.Figure(data=[go.Surface(x=W1, y=W2, z=Z)])

# Add axis lines (x=0, y=0, z=0)
# X-axis line (y=0, z=0)
fig.add_trace(go.Scatter3d(x=[-1, 1], y=[0, 0], z=[0, 0], 
                          mode='lines', line=dict(color='black', width=5), 
                          showlegend=False))
# Y-axis line (x=0, z=0)
fig.add_trace(go.Scatter3d(x=[0, 0], y=[-1, 1], z=[0, 0], 
                          mode='lines', line=dict(color='black', width=5), 
                          showlegend=False))
# Z-axis line (x=0, y=0)
z_max = np.max(Z)
fig.add_trace(go.Scatter3d(x=[0, 0], y=[0, 0], z=[0, z_max], 
                          mode='lines', line=dict(color='black', width=5), 
                          showlegend=False))

fig.update_layout(
    title="Ridge regularization",
    scene=dict(
        xaxis_title='w1',
        yaxis_title='w2',
        zaxis_title='z'
    )
)
fig.show()

Key concepts schema

Bias-Variance tradeoff / Regularization / Shrinkage
- Ridge
  - L2 norm penalty
  - Encourages all weights to be smaller
  - Equivalent to MAP with Gaussian prior
  - Aka “weight decay”
- LASSO
  - L1 norm penalty
  - Encourages some weights to zero (feature selection)
  - Equivalent to MAP with Laplace prior