The R-squared values both increased significantly, and the probability of the F-statistic dropped.
This is a testament to the utility of feature engineering and compelling evidence that there is a substantial relationship between climate, topography, and happiness.
However, there is still a high condition number, which means that the multicollinearity problem between my features has not been solved.
It actually got worse, which makes sense since my newly engineered features would not be independent from the raw features they are composed of.
At this point, I decided to take a step back and look at my three features which were the most predictive of happiness.
Those features were mean elevation (transformed), air quality, and percent area water (transformed).
The interaction terms didn’t really seem to help!Keeping these terms in mind, I decided to implement regularization to further refine my model:RegularizationWhen creating predictive models, the practice of regularization essentially boils down to penalizing the complexity of a model in an effort to enhance its ability to generalize to unseen data.
An overly complex model will predict the target variable (happiness, in my case) from in-sample data well, but will perform poorly when presented with out-of-sample data.
There are two main methods of regularization for regression problems.
These are known as Lasso and Ridge regression.
Lasso will help by eliminating unneeded features and Ridge will reduce collinearity (Here is a useful article on the important differences between the two methods).
Since I was interested in both these affects, I tried both methods.
I first created Lasso and Ridge models for the three most important features I mentioned above (mean elevation transformed, air quality, and percent area water transformed) along with an additional feature: the humidity — rainfall interaction term.
Here is a visualization comparing the performance of the two models for the above set of features:The axes of this graph are true happiness against predicted happiness, meaning that predictions from a perfect model would fall exactly on the green line.
Upon inspection, it looks like Lasso predictions are consistently better towards the edges of the distribution, but Ridge predictions are better in the middle.
The vertical distance between a given prediction point and the prefect green line is the error associated with that prediction.
RMSE (root mean squared error) is the average error across all predictions a model makes (in the graph above, it is the average distance a prediction is from the perfect line).
RMSE is a good metric for determining a regression model’s usefulness.
The RMSEs for the Lasso and Ridge models visualized above are 4.
77 and 4.
This means that for a given set of features, the Lasso prediction of happiness will be 4.
77 points off target, while the Ridge prediction will be only 4.
25 points off.
Unfortunately, the standard deviation in the happiness scores is around 3.
So these predictions are quite off.
In the hopes of finding a better model, I tried excluding/including various features.
I found that using the three most important features mentioned above and replacing the humidity — rainfall feature with temperature was a combination that produced significantly improved models.
Here is the same visualization as above except for models with the new set of features:The RMSEs for the Lasso and Ridge models corresponding to the new set of features were 3.
33 and 3.
ResultsUsing the Lasso model with the latest set of features, I predicted the happiness of all 50 states using their respective climate and topography features and ranked them.
Below you can see my model’s prediction for the top 15 happiest states compared to the actual top 15 happiest states:The color red indicates that my model’s predicted ranking for the state was within 5 spots of its actual ranking.
ConclusionsMy model made some great predictions and some terrible ones as well.
What does this mean?Perhaps I could have distilled a better model.
Maybe there were interaction terms I did not try that would have been particularly useful.
Maybe the series of features I used was not the best for predicting happiness.
Maybe I could’ve optimized my Lasso and Ridge regressions with a wider hyper-paramter search (here is an interesting source on this topic).
More likely however, is that the relationship between climate, topography, and happiness exists but is not that strong.
The presence of predictive capability in my model does demonstrate that there is a relationship between the features I analyzed and happiness, but it does leave much variability in happiness unexplained.
To have a more robust happiness predictor, we likely would have to look at features outside of climate and topography.
My hunch is that median household income, percentage of people below the poverty line, and wealth disparity would be strong predictors for overall happiness.
Furthermore, in creating a more robust model, it could be helpful to look at countries as opposed to American states.
Performing linear regression on only 50 data points (50 states) is ill-advised.
Predictive models tend to perform better when trained on more data.
All things equal, this model should not be used as a happiness predictor.
But it does prove that there is some relationship between climate, topography, and happiness, and that climate and topography features should be included in a more expansive set of features in order to predict happiness.
In the future, I will incorporate monetary and demographic features (perhaps using this kaggle dataset), and refine my happiness metric by looking for more studies.
I will also take a deeper dive into the rich landscape of feature engineering, as it provided the largest performance boost I encountered during this process.
Thanks for reading my post and I hope you enjoyed it!.Feel free to send any questions or comments my way.
See you next time!.