If you just quickly browse through the salaries of NBA players, you can see that it’s not necessarily the players who perform the best who get paid the most.

Carmelo Anthony was a prime example.

If that’s the case, then what does determine who gets paid the most?What better way to answer that question than with data science!If you want a more in-depth view of this project, or if you want to add to the code, check out the GitHub repository.

Data PreparationFirst things first.

I need to import the necessary data, which I have stored in my GitHub repository.

I’m also importing the necessary Python libraries.

Data CleaningLooking through the data, I can see that there are a lot of annoying things to deal with, particularly NA values.

The most important dataset at the moment is my Seasons_Stats set, which contains a boatload of metrics and player statistics that will become the variables to predict the predictor feature later.

Time to do some house cleaning.

Finding all of the null valuesRemoving NA values from the Year column and converting to integer valuesAs I was working through the data cleaning, I ran into an issue.

The salary data that I had collected but hadn’t cleaned up yet only went from 1980 to the present, neglecting any prior information .

Because of this, I had to cut off all of the stats from players before 1980.

Bummer.

I then took the time to actually learn what the various column names were.

Here are the meanings of the statistics:‘Pos’ — position‘Tm’ — team‘G’ — games played‘MP’ — minutes played‘PER’ — player efficiency rating‘TS%’ — true shooting percentage (weights 3-pointers higher)‘3PAr’ — 3-point attempt rate‘Ftr’ — free throw attempt rate‘ORB%’ — offensive rebound percentage‘DRB%’ — defensive rebound percentage‘TRB%’ — total rebound percentage‘AST%’ — assist percentage‘STL%’ — steal percentage‘BLK%’ — block percentage‘TOV%’ — turnover percentage‘USG%’ — usage rate‘OWS’ — offensive win shares‘DWS’ — defensive win shares‘WS’ — win shares‘WS/48’ — win shares over 48 minutes‘OBPM’ — offensive box plus/minus‘DBPM’ — defensive box plus/minus‘BPM’ — box plus/minus‘VORP’ — value over replacement player‘FG’ — field goals made‘FGA’ — field goals attempted‘FG%’ — field goal percentage’‘3P’ — 3-pointers made‘3PA’ — 3-pointers attempted‘3P%’ — 3-point percentage‘2P’ — 2-pointers made‘2PA’ — 2-pointers attempted‘2P%’ — 2-point percentage’‘eFG%’ — effective field goal percentage‘FT’ — free throws made‘FTA’ — free throws attempted‘FT%’ — free throw percentage‘ORB’ — offensive rebounds‘DRB’ — defensive rebounds‘TRB’ — total rebounds‘AST’ — assists‘STL’ — steals‘BLK’ — blocks‘TOV’ — turnovers‘PF’ — personal fouls‘PTS’ — pointsFor the salary data, I realized that inflation and other time-related issues would cause problems.

Because of this, I created other metrics that would allow salaries to be relative to one another instead of an absolute number.

These metrics are: team payroll, player salary as proportion of team payroll, team payroll as proportion of total NBA salary payroll, and player salary as proportion of total NBA salary payroll.

It took some time to create all of the values.

I used Excel to automate some things, but I had to create a Python script to save even more time.

Here’s a brief glimpse at what I created to do that.

Feature EngineeringLike I mentioned before, there were a bunch of new variables I had to create.

I named them: player leverage, league weight, team market size, region of US.

Player Leverage columnLeague Weight columnTeam Market Size columnUS Region ColumnMerging DatasetsI ended up only using two of the three datasets I started with because they encompassed all of the necessary information.

Now, I need to merge these two datasets together.

Doing a left merge to combine every columnRestricting only to after 1990Dealing With Categorical VariablesBecause the predictor variable in this dataset is continuous, I need to use a machine learning algorithm that takes in continuous values.

The problem is that some of the features in the set are categorical, so I need to create dummy variables for these.

Exploratory Data AnalysisNow it’s time to explore some of the relationships in the data!.Admittedly I could have spent more time in this phase, so feel free to explore the data more for yourself.

Predictive Model BuildingTime to build a model that will yield the most accurate results!First we need to split the data into testing and training data.

I used three machine learning algorithms to see which one fared the best: regression, decision trees, and random forests.

Here’s the training for regression:I also quickly used the coefficients that were outputted to see which variable had the biggest impact on the model based on the linear regression.

Based on this, we can see that the feature of a player being a center had the highest effect on how much a player is paid relative to the rest of the league.

Isn’t that intriguing?Here is the training for the decision tree model:Here is the training for the random forest model:Using the metrics MAE (mean absolute error), MSE (mean squared error), and RMSE (root mean squared error), we can compare how each of the models did in predicting league weight for each player.

Based on these numbers, it appears that linear regression actually did the best!.It’s important to know, however, that I didn’t do much tuning of the hyperparameters for the decision tree and random forest models, so perhaps that played an important role in why they underperformed.

Potential ImprovementsThank you for taking the time to read this!.This was actually my first major end-to-end data science project, so there is much room for improvement.

There are several things that could be improved for this project.

1.

Dealing with the time value of money more effectivelyBecause of the fact that the value of money changed over time, I knew that I couldn’t compare salaries from the 1990s to the salaries of players today.

Instead of converting the numbers to a single adjusted number, I opted to created a metric called League Weight that would make all of the player salaries relative to one another in terms of who has a greater proportion of the total payroll.

Maybe this wasn’t the best way to deal with the problem.

Feel free to find a better way to adjust the salary numbers.

2.

Getting more over-arching dataI only used data from 1990 to the present because any salary data and advanced statistics were sparse and difficult to find for the years earlier.

Before 1979 there wasn’t even a 3-point line!.Nonetheless, for people who are less impatient than me, don’t be afraid to look for more historic data that might allow for a more comprehensive analysis of the salary data of NBA history.

3.

Adding more featuresI used a lot of variables in this project, but the significance of the variables to the final outcome definitely differed a lot.

There are many other advanced stats in the NBA that I didn’t use that others could add.

4.

Using more machine learning algorithmsI only used three machine learning algorithms, and all of them were pretty simple.

I considered using neural networks, but that would have required a lot of hyperparameter tuning.

5.

Fine-tuning the algorithms already usedEven among the few algorithms I used in this project, for decision trees and random forests, I could definitely have tuned the models to make them more effective.

.