2\(\lambda\) controls how strong the regularization applied is: High \(\lambda \implies\)High bias and Low variance (and vice versa for low \(\lambda\)).
LASSO adds an \(L_1\) norm penalty to the loss function3:
3 Remember \(\left\| W \right\|_1 = \sum_{i=1}^n |w_i|\)
\[
\mathcal{L}_{\text{LASSO}} = \mathcal{L} + \lambda \left\| W \right\|_1
\]
Ridge adds an \(L_2\) norm penalty to the loss function4:
4 Remember \(\left\| W \right\|_2^2 = \sum_{i=1}^n w_i^2\)
\[
\mathcal{L}_{\text{Ridge}} = \mathcal{L} + \lambda \left\| W \right\|_2^2
\]
Where \(\lambda\) is the regularization parameter.
MLE vs MAP
It is educational to link these regularization techniques to the frequentist vs Bayesian perspective of model optimization.
We usually think of model optimization from a frequentist perspective: We apply Maximum Likelihood Estimation (MLE) to find the parameters that best describe the data:
\[
\max_\theta p (\mathcal{D} \mid \theta)
\]
However, we could also take a Bayesian perspective and apply Maximum a Posteriori (MAP) optimization. In which case, we would be optimizing over the posterior5:
5 I remove the denominator (aka evidence) as it is independent from the parameters
\[
\max_\theta p (\theta \mid \mathcal{D}) \propto \max_\theta p (\mathcal{D} \mid \theta) \cdot p (\theta)
\]
Notice that now we have to assume some prior distribution over the parameters \(p(\theta)\). Interestingly, this prior distribution can be seen as a regularization term in the loss function.
For instance, if we assume a Gaussian prior distribution over the parameters \(p(\theta) = \mathcal{N}(\theta \mid 0, \sigma^2)\), we get the following loss function: \[
\mathcal{L}_{\text{MAP}} = \mathcal{L} + \lambda \left\| W \right\|_2^2 + \text{const}
\]
Where \(\lambda = \frac{1}{2\sigma^2}\) and \(\text{const}\) is a constant term independent from the parameters.
This means that we can see the MLE optimization + Ridge regularization as a MAP optimization with a Gaussian prior distribution.
Similarily, if we assume a Laplace prior distribution over the parameters \(p(\theta) = \text{Laplace}(\theta \mid 0, \sigma^2)\), we get the following loss function: \[
\mathcal{L}_{\text{MAP}} = \mathcal{L} + \lambda \left\| W \right\|_1 + \text{const}
\]
Where \(\lambda = \frac{1}{\sigma^2}\) and \(\text{const}\) is a constant term independent from the parameters.
This means that we can see the MLE optimization + LASSO regularization as a MAP optimization with a Laplace prior distribution.
LASSO visualized
LASSO6 encourages sparsity in the models parameters: It forces close-to-zero parameters to zero, making the model only use vital features7.
6 Least Absolute Shrinkage and Selection Operator
7 This is why it is called “selection operator”: it performs feature selection.
If your model has two parameters \(w_1\) and \(w_2\), LASSO penalization would look like this:
Ridge regression encourages all weights to be smaller equally. Most modern optimizers use a variant of Ridge regularization called “weight decay”. For instance, in pytorch we would add Ridge regularization as such: