Bias and variance in linear modelsA look at the bias and variance tradeoff for linear modelsNischal MBlockedUnblockFollowFollowingJun 24I am sure everyone has seen this diagram in the past:First figure in Scott Fortmann-Roe’s Understanding the Bias-Variance Tradeoff.
Understanding the Bias-Variance Tradeoff goes into great detail about the tradeoff and the errors, and I highly recommend it.
The above figure is the first figure on the post and shows predictions of multiple models with different bias and variance errors.
The bullseye is the true value we want to predict and the blue dots are what the model actually predicts.
In this post I want to try and visually show the bias and variance tradeoff take shape in linear models.
Why linear models?Because they are well understood and give a very easy way of controlling these errors — through regularization.
Ordinary Least Squares (OLS) regression is known to give unbiased results with low variance as compared to non linear models.
Ridge (OLS with L2 penalty) and Lasso (OLS with L1 penalty) give biased results with a much lower variance as compared to OLS.
The degree of penalization is controlled by the regularization coefficient, λ.
Which in-turn controls the two errors as we will see below.
Lasso is actually a special case here due to its aggressive nature of pushing coefficient estimates to zero, but helps to keep things in perspective.
You can read more about regularization here.
The procedureI will stick to the same method used to describe the errors conceptually in Scott’s post.
I pick all fixed numbers arbitrarily here.
Simulate 500 data points from y = α+ βx + ϵ, where ϵ ~ N(0, 8), x ~ U(-2, 2), α = 2 and β = 3.
Do step one 1000 times and gather all datasets.
For each set, fit OLS, Ridge, and Lasso models with a fixed λ to predict y for x = 3.
Expected prediction should be 2 + 3 x 3 = 11Now we have 3000 (1000 OLS + 1000 Ridge + 1000 Lasso) predictions we can look at to see the true “nature” of these models.
You can find all the code for this on my GitHub page here.
A note on how to read the plots.
I want you to pay attention to the two things:1.
The distance between the true value — shown as black dashed line— and the average predicted value for the model — shown as dashed line of the same color.
This distance is the bias (or bias squared) of the models.
A large shift from the true value (11) is a large bias.
The width of the histograms is the variance of the model.
A large width is a larger variance.
λ ~ 0Starting off with a very tiny lambda value.
This is equivalent to having no penalty, thus we would expect the same results as in OLS for Ridge and Lasso.
The plot gives no surprises.
All three distributions overlap with means around the true value.
Notice how spread out the distribution is though.
There is a large variance is the prediction ranging from 9 to 13.
λ = 0.
01With a (very) small penalty it is easy to see regularization in effect.
The distributions have shifted to the left (evident by the mean).
A small bias is observed in Ridge, and a relatively larger one in Lasso.
It is not clear if the variance has changed.
λ = 0.
05At λ = 0.
05, Lasso is already too aggressive with a bias of 3 units.
Ridge is close enough but looks like it has the same variance.
So there is no advantage of Ridge yet for this data.
λ = 0.
1An almost similar result as above.
It is hard to notice any change in variance yet.
λ = 0.
5A higher penalty gives some (reasonably) satisfactory clues.
Bias on Ridge has increased close to three units, but the variance is smaller.
Lasso has very aggressively pushed for zero coefficient estimate for β resulting in a very high bias in the result but has a small variance.
λ = 1 — Some good results!Here the tradeoff has clearly switched sides.
The variance of Ridge is small at the cost of higher bias.
λ = 5Just to really drive the point home, here is a very large penalty.
Variance on Ridge is small at the cost of much higher bias.
You will probably never need such large penalties.
But the facts are clear, a lower variance at the cost of higher bias.
Bias and variance for various regularization valuesRepeating the above for a range of regularization values gives a clear picture.
Bias is computed as the distance from the average prediction and true value — true value minus mean(predictions)Variance is the average deviation from the average prediction — mean(prediction minus mean(predictions))The plots give the same observation.
OLS has the lowest bias but highest variance.
Ridge looks like a smooth shift and Lasso is constant after around λ = 0.
2 (β becomes 0, thus predicting y = α for all values of x).
Ideal distributionsA better choice of data could give us an ideal plot of the sampling distributions of predictions.
The advantage Ridge offers is immediately evident here because of the overlapping distributions.
Ridge give a slightly biased prediction, but will give a closer prediction much more often than OLS.
This is the true value of Ridge.
A small bias, but more consistent predictions.
OLS gives an unbiased result but is not very consistently.
This is key, OLS gives an unbiased result on average, not always.
And that is the bias and variance tradeoff taking shape in linear models.
You can find all code I used for this post here.
I recommend you run it for different values of λ to see the changes for yourself.
Maybe even use it on a different dataset and see if you can see some overlap.
If you have any suggestions for this post please feel free to reach out and say hello.
I just want to take a moment to thank everyone that made this post possible.
Do take a moment to share and show your appreciation.
🙂 Thanks for reading!.. More details