I’ve briefly written about ways to combat overfitting in the post here when I wrote about regularization. In that post, I talked a bit on L1 and L2 regularization, and the brief difference between them. In this post, I’m going to do a deep dive into the differences. This article is an excellent resource in helping me understand it, so if you want an even better explanation, go here!
When we use regularization, we’re adding a penalty to the loss function to reduce amount of noise the model would learn. The penalty you add can either be an L1 (Or Lasso), which is the absolute value of the coefficients, or an L2 (Or Ridge), which the square of the value of the coefficients.
The motivation for adding the penalties to the loss function is to constrain the growth of the coefficients, that may be negatively impacted by outlier data. If we do not add regularization, the coefficients for the features may grow in an uncontrolled manner to find the “perfect” fit for the training data, which is the hallmark of overfitting.
In L1 Regularization, the penalty we’re adding is the absolute value of the coefficients.
In the image above, the use Residual Sum Squares (RSS) as the chosen loss function to train our model weights. The regularization factor we add at the end is the sum of the absolute value of each of the coefficients predicted, multiplied by lambda, which is the L1 value.
When we add this L1 regularization factor, we’re constraining the values of the coefficient growth to a diamond shape model. In the example below, we have 2 coefficients we want to optimize for, b1 and b2. When we use a L1 regularization, the search space for the values of b1 and b2 become bounded in the diamond. When implemented on a loss function which produces the blue/red contour lines, we want to find the point on the diamond that is closest to the minimum point.
We see that in the constrained search space for b1 and b2, the values that are closest to the minimum point of the loss function tends to zero on either b1 or b2. Because of this, L1 regularization tends to produce a set of coefficients that are small, or zero, thereby producing a sparse matrix of coefficients.
L2 regularization on the other hand adds a penalty that is the square of the coefficients
L2 constrains the search space of b1 and b2 in the same way that L1 does, however, the constrains are now a squared of the coefficients. This produces a search space for b1 and b2 that is an ellipse instead of a diamond. The growth of b1 and b2 are thus constrained to be inside the ellipse.
In trying to find the point on the ellipse that is the closest to the minimum point of the loss function, we find that the values of b1 and b2 are mostly non-zero, and lies at the edge of the ellipse. Thus, one of the main differences between and L1 and L2 regularization is that and L1 produces a set of spares matrix of coefficients values, while L2 does not produce a sparse matrix.
Why a Diamond vs Ellipse Search Space?
To understand why an L1 produces a Diamond search space, while and L2 produces an Ellipse, it’s simply the mathematical function of the constrains.
In the image above, we have the loss function (or cost function) depicted in the red lines, and the minimum of the loss function is the point in the center.
When we use L1, or Lasso Regression, we create a search space that is bounded by absolute values of the coefficients, and that produces a box shape. And when we use a L2, or Ridge Regression, we create a search space that is bounded by the square values of the coefficient, thus producing a ellipse shape.
Because the closest point of the search space in a Lasso regression to the minimum of the loss function is likely to be at the edges, most coefficients will end up being zero. The closest point in a Ridge regression to the minimum of the loss function is somewhere on the edge of the ellipse, thus it’s non-zero value for the coefficients.
L1 for Feature Selection
Because the coefficient values for L1 is sparse and mostly zero, we can use it for feature selection. The features that have their coefficients assigned as zero are features that are non-discriminatory, and therefore do not affect the classification process.
When defining a model, we specify an L1 penalty for regularization. After we have trained the model, the Lasso would tell us the values of the coefficient for each feature. Using that, we can remove those features that the Lasso has assigned a zero or small value to, and retrain the model to evaluate if doing so has improved its performance.
Because L2 does not shrink the coefficients of the features to zero, we are unable to use that for filtering features away.
To sum up, regularization is really just constraining the values of the coefficient by defining a search space. When we use L1 regularization, we define a diamond shaped search space, for which the closest point to the minimum is on the edges and therefore producing zero valued coefficients. While when we use L2 regularization, we define a ellipse search space, where the closest point to the minimum is non-zero. We do this constraining so the coefficients do not grow too much, and overfits on the given data.
We can also use sparse matrix output of L1 regularization for feature selection, and remove those features whose coefficients are assigned to zero by the Lasso.