Skip to the content.

Ridge Regression

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated.

It has the following effects:

scikit-learn ridge guide gives some discussion of the ridge regression.

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients. The ridge coefficients minimize a penalized residual sum of squares:

\[\min_{\theta} \{\| X \theta - y\|_2^2 + \alpha \|\theta\|_2^2\}\]

with \(\alpha\) be a nonnegative real value.

Take the partial derivatives of the cost function and make them equal 0, we get the solution:

\[\hat \theta = (X^TX + \alpha I)^{-1} X^T y\]

If we decompose X with SVD decompostion \(X=U\Sigma V^T\), we have \(X^TX + \alpha I = V ({\color{read}{\Sigma^2 + \alpha I}}) V^T\) which is invertible.

Then the above equation will be:

\[\hat \theta = V (\Sigma^2 + \alpha I)^{-1} \Sigma U^T y\]

We can write it in another form:

\[\hat \theta = \sum_{i=1}^n \frac{\sigma_i}{\sigma_i^2 + \alpha} u_i^Tyv_i = \sum_{i=1}^n {\color{red} {\frac{\sigma_i^2}{\sigma_i^2 + \alpha}}} \frac{1}{\sigma_i}u_i^Tyv_i\]

As we can see, the complexity parameter \(\alpha \geq 0\) controls the amount of shrinkage: the larger the value of \(\alpha\), the greater the amount of shrinkage and thus the coefficients become more robust to collinearity. With this shrinkage, ridge regression reduces variance of the model params. To summarize: