Building a GradientBoostingRegressor to predict NBA player salariesA brief guide to model validation/tuning and how to explain your model outputsSteven LiuBlockedUnblockFollowFollowingFeb 4Stephen Curry’s gross salary (ESPN)Model tuning and validation are hugely important steps in the machine learning (ML) process.
Even though the model is doing most of the heavy lifting, there are things you can do to optimize its performance.
What’s also important is being able to explain the model’s output.
Many models today are complex, leaving us struggling to interpret the results.
But with the help of SHAP, I’ll show you how you can demystify your ML models.
We will use this NBA player dataset from kaggle for our demonstration.
Train the GradientBoostingRegressorBegin with a basic model and some random parameters to see how well it performs initially.
$958956 average error; 9.
14% errorScoring strategyOne of the first things we have to do is come up with a scoring strategy to evaluate prediction accuracy.
In this case, we use the mean absolute error (MAE) as our error metric because it measures the average difference between the predicted and actual salary.
This allows us to see how far off the model is on average from the actual salary.
So in this case, the predicted salary is expected to be off by about $985,000 or approximately 9%.
While this may seem like a significant margin, keep in mind that many player’s are being paid tens of millions!Some other common regression error metrics are R² and mean square error (MSE).
Whichever metric you end up picking will most likely depend on the problem you are working on, so select one that is most interpretable in the context of what you are trying to solve.
Hyperparameter tuningHyperparameters refer to features that aren’t learnt from the model and the general idea is that we fiddle around with these to improve prediction accuracy.
They specify the model architecture and have a profound effect on performance.
For the Gradient Boosting Regressor, some of the hyperparameters include learning rate, number of boosting stages to perform and the number of features to include when splitting (refer here for a complete list).
Grid SearchTo select the optimal set of hyperparameters, one option is to perform a grid search where we specify a range of hyperparameters to test.
In this case, we try:3 x 2 x 3 x 3 x 3 x 3 x 3 = 1,458 possible combinations of hyperparametersKeep in mind that this can become a computationally expensive task if we try more hyperparameters or increase the range of values.
In general, there are two methods for performing a grid search.
Exhaustive grid search is exactly what it sounds like.
This will explore all possible candidates in the hyperparameter subspace and return the best combination.
Randomized grid search explores the same hyperparameter subspace, but performs a random search over the parameters so that not all combinations are trialed.
As it turns out, there are two benefits to this approach:increased efficiency — because we are not testing every single possible combination — without compromising performanceutilizes a more efficient search strategy for discovering important hyperparameter values (read more about it here)Comparison of grid and random search by Bergstra and BengioCross-validationOne of the most common pitfalls is to overfit the model.
When we say overfit, we mean the model has simply memorized the training data instead of learning the underlying pattern.
If we were to fit this model on the test data, we would see significantly reduced performance because the model can’t memorize what it hasn’t seen before.
Fortunately, we can employ a k-fold cross-validation (CV) procedure to prevent overfitting.
Example of a 5-fold cross-validation schemeEssentially, we split the training data into k (5 in our case) identically sized folds.
The model is trained on the first 4 folds, tested on the last fold and a performance measure is calculated.
This process is then iterated where each time, a different fold acts as the validation set.
In doing so, this reduces variability and allows us to obtain a better estimate of the model’s performance.
At the end, an average of the performance metric is calculated and then we can decide whether we are (1) happy with the results or if we should (2) go back and tweak the hyperparameters some more.
While CV is considered a more robust validation strategy, note that it can also be computationally costly if you have a large dataset.
If your test set is of a respectable size, you don’t need CV with multiple folds.
Final testOnce we are satisfied with our CV results, fit the model with the optimal set of hyperparameters on the test data.
$34304 average error; 0.
39% errorWell would you look at that, we reduced the average error from 9% to less than 1%!Explaining the model with SHAPSHAP is one of my favorite tools because it allows me to examine how the model arrived at it’s prediction.
For certain models (e.
, linear models), we can easily understand what’s happening, but as we adopt more complex models it becomes increasingly difficult to interpret the model’s output.
In the author’s, Scott Lundberg and Su-In Lee, own words here:The ability to correctly interpret a prediction model’s output is extremely important.
It engenders appropriate user trust, provides insight into how a model may be improved, and supports understanding of the process being modeled.
Explain a single player’s predicted salaryThe red and blue values explain the features that contribute to the model output, and the base value is the average model prediction.
For this specific player, their predicted salary is $3.
7 million which is significantly lower than the base value of $8.
Their age, points/game, and usage percentage force their predicted salary lower while their defensive rebound percentage pushes their value higher.
Explain a bunch of player’s predicted salaryWe can also explain the entire test set, only this time, the single player plot is rotated vertically, and all the player’s are stacked horizontally.
On the lower end of the spectrum, there are more forces acting to push a player’s predicted salary lower while on the higher end, the opposite is true.
In the notebook, this is an interactive plot so you can order by features such as points/game, true shooting percentage and assists percentage to see how they affect the output value.
Summarize all feature importancesLastly, we can summarize all feature importances by plotting the distribution of SHAP values for each feature, where the colors represent the magnitude of the feature value.
Here for example, a lower age decreases the predicted player’s salary while higher points/game increases the predicted player’s salary.
This provides a nice overview of how each feature impacts the model’s prediction.
TLDRPick a scoring metric that makes the most sense for youPrefer a randomized search over a exhaustive searchPrefer a single fold CV over a k-fold CV strategy if you have a large enough datasetUnderstand your model predictions with SHAPTo follow along, you can find the full code set and notebook here.
Thanks for reading!.Stay tuned for more as I continue on my path to become a data scientist!.✌️.. More details