NBA Salary PredictionsJosh RossonBlockedUnblockFollowFollowingMay 1This project aims to explore how a wide variety of NBA statistics can be used to predict the salary of an NBA player from 1995 to 2017.
My goals for this project are to:Discover which statistics are the best predictors of an NBA player’s salaryUse a machine learning model to predict NBA salariesDetermine which players have been overvalued and undervalued according to their given vs.
predicted salaryDetermine which teams are the best and worst at extracting value from their players and if there is a correlation with the amount of games a team winsDataI got my main data, a database of ~50 statistics and salary information for every NBA player dating back to 1950.
The link to the data can be found here.
I made the decision to only use the data starting in the 1995 season because the NBA was very different in 1950 than it is in the 21st century.
I chose 1995 because this was the first year that two new teams were added, bringing the team total to 29 (the 30th team, the Charlotte Bobcats, was added in 2004).
Two other data sources I used, salary cap information and team wins per season, were both taken from Basketball Reference and copied into an Excel spreadsheet, which I then imported into a Dataframe.
The data did have to be cleaned a bit: total statistics like minutes played or total points scored were replaced by their per game equivalents to normalize the statistics, assuming that the player played more than a certain number of games that season (if not, he was removed from the dataset).
Additionally, there were a number of players who did not have salary information listed for that season and were removed from the data set.
The first five data pointsWhich Statistics?After looking at the data, an early problem came up.
The average NBA salary has drastically increased over the past 20 years.
The solution was to also normalize salary data by putting it as a percentage of the league’s salary cap, the total salary limit that a team can spend on its players in a given season.
The salary cap has risen as player salaries have.
From there, it was time to explore the relationship between salary and statistics.
I started with a correlation heat map for all of the statistics in the dataset.
Unfortunately, because of the number of statistics, not much useful information was gained from looking at this.
So, I switched to a correlation heat map for the 8 statistics that had the highest Pearson r² values with salary.
I then split up the statistics into three subcategories: basic, regular, and advanced.
The basic statistics only contained the players’ positions, ages, and minutes/games played.
The regular statistics had ones that the casual fan would understand: points per game, rebounds per game, field goal %, etc.
Advanced statistics were created to evaluate player performance in more detail, some of which include Player Efficiency Rating (PER) and Win Shares (WS).
After the stats were split up, I used the same method to find the 8 most correlated statistics in each character with salary, and created heat maps and scatter plots for them.
Heat map (left) and scatter plots (right) for advanced statisticsThere was some multicollinearity between some of the statistics in each character, so after an analysis, a few statistics were removed from the top 8.
Machine Learning ModelsThe next step was to determine which subcategory of statistics was the best predictor of salary.
To accomplish this, linear regression machine learning models were created for each of the three subcategories, the data was split into train and test data, and cross validation was performed.
The result was that the regular subcategory of statistics was the best model with a root mean squared error (RMSE) of 6.
46 (the lowest value of the three) and an r² of .
466 (the highest value of the three).
Below is the plot of the residual values-the difference between the actual % of the salary cap per player salary and the predicted %.
Positive residual values indicate that the actual value is greater than the predicted one.
Valuation of PlayersThe regular linear model was used to predict the salary of every player entry in the data set.
These predictions were then compared to the actual salaries that players earned to create a residual column in the Dataframe, which was then sorted by the residual value.
A portion of the top 25 most undervalued and overvalued players are highlighted below.
The most undervalued players according to the modelSome of the most overvalued players, according to the modelThe model states that Michael Jordon in 1997 was the most overvalued player in the last 22 years.
Looking at this data had me questioning the model.
The most overvalued players list was filled with in-their-prime superstars who were paid high salaries because of their abilities, including the all-time great Michael Jordan.
Summary statistics on the overvalued and undervalued players help explain what is going on.
Averages for most undervalued playersAverages for most overvalued playersThe most undervalued players are mostly young, up and coming stars who are still on their rookie contracts, and therefore don’t take up much salary cap room.
The most overvalued players were mostly in their prime superstars who were paid accordingly.
After this realization, I decided to make the switch to normalize residuals by dividing them by the percentage of the cap for each player to create a residual %.
This provided a better representation of who was over and undervalued.
The new and improved most undervalued playersMy final task was to relate the valuation of players to the teams they played for, and how that correlated with team performance.
I grouped the players by team, calculated the average % percentage per team based on average actual % of cap spent per player vs.
average predicted %, and added the average amount of wins over the past 22 years as an additional category.
The top section of the data grouped by teamThe model was extremely down on the most overvalued players, predicting that they should take up a negative percentage of their team’s salary cap, AKA the players should be paying the team for the privilege to play.
Even so, it was helpful in determining which players were most overvalued.
From there, I created a scatter plot of residual % vs.
average number of wins per team.
Quite surprisingly to me, the teams that overvalued its players generally had more success than the teams that undervalued them.
Problems and Potential ImprovementsOne problem I encountered that I was unable to solve was from the data set.
There were cases of a player switching teams mid-season via a trade or by other means, and therefore that player had several lines (one line per each time and one line that displayed total yearly statistics) in the data set but the same salary.
I deleted the “total” line for each player and hoped that the other lines would contain similar statistics so not to skew the data.
Some high-profile players were missing salary values and had to be deleted from the data set, including Shaquille O’Neal.
Players on the ends of the salary spectrum provide valuable information to the machine learning model.
Furthermore, some salary data was incorrect.
The data set has Cameron Bairstow making $19 million in 2015, when in reality he earned less than $1 million.
Another problem in my project was the time period used.
The NBA has undergone a drastic change of the past 22 years, and my analysis did not reflect that.
In recent years, teams have based their salary offers to players much more on advanced statistics than in years past.
Additionally, player contracts have undergone a change.
In 1995, there was no maximum contract, and a player could be signed by his own team to any amount, even if it was over the salary cap.
In 1997, Michael Jordan earned $33 million, more than the salary cap allowed the Chicago Bulls to spend on their entire roster.
The maximum contract was introduced by the league in the following years to prevent such contracts.
This mix of old and new contracts led my model to believe that superstars in the late 1990s and early 2000s, were on unreasonably large contracts, because the most recent batch of superstars does not earn that kind of money.
The project could have been better with a more recent time period focus.
I made the assumption that all teams would value players equally, and therefore did not use team as a statistic in the linear regression model.
For a future project, it would be interesting to see the effect of individual teams on player salaries.
ConclusionsThe two main conclusions from my analysis are as follows:As obvious as it seems, players who score more points and make more field goals are going to be paid a higher salary, and that is why the regular subcategory produced a better model than the advanced model, which was the subcategory that I thought was going to produce the best model.
Teams who placed a higher value on their players than what the market would dictate they were worth actually had more team success.
This is most likely to contributed to the presence of star players on these teams, which the model tended to view as overvalued.