How Machine Learning Made Me Fall in Love with the WNBAUsing k-means clustering to fulfill my fantasy of building a women’s sports dream team.
Jasmine VasandaniBlockedUnblockFollowFollowingApr 30I had one of those daydreams that come to you from out of nowhere.
Before my eyes fell the image of an all-star women’s sports team.
The greatest female players across all teams and sports genres playing on the same team.
Like Mario Kart except instead of Mario, Princess Peach, and Toad you have Serena Williams, Lisa Leslie, and Katelyn Ohashi all playing on the same platform.
I realized that I could make this vision a reality by using data science and machine learning tools to design the best teams and predict what it would look like if they were to play against each other.
However, as with all big dreams (and big data), I decided to start with a subset of the women's sports world and work my way up towards acquiring data from other women's sports.
So I’m kicking off my mission to make my daydream a reality by analyzing data from the WNBA.
Get ready everyone, the ultimate WNBA game is about to begin.
LA Sparks getting hyped before playing.
Step 1: Getting Every Single WNBA Player On the Court (or, data acquisition)The first thing I did was scrape every single WNBA player’s per-game career stats from Basketball-Reference.
In total, I acquired stats for 923 WNBA players.
Below is a preview of what the head of my DataFrame looked like.
If you want to know what each of the columns in the DataFrame below stands for, check out this glossary.
I’ll go ahead and translate one row so that you get an idea of how to interpret this data.
Farhiya Abdi of the Los Angeles Sparks has played a total of 52 games and started 5 of those games.
On average per game, she played on the court for 9.
6 minutes and made 38% of attempted shots, 25% of attempted three-pointers, 43% of attempted two-pointers, and 68% of attempted free throws.
Abdi also made an average of 1 rebound, 0.
4 assists, 0.
2 blocks, 0.
4 turnovers, and 1.
2 personal fouls.
And finally, Abdi scored an average of 2.
9 shots per game.
Once I’m done clustering all players into ranked teams, I’ll use these per-game stats to simulate a data-driven playoff between the top two teams.
Step 2: Divide all WNBA Players by their PositionsIn order to create the best teams, I decided to split all players according to their positions and use machine learning to develop ranked groups within each position.
As documented on Basketball-Reference, WNBA players are classified according to these five positions: Forward, Center, Forward-Center, Guard, and Forward-Guard.
Of the 923 WNBA players, 908 had positions listed.
Here’s a breakdown of players by position.
Step 3: Identify Best Performance Rates for Each PositionAfter I divided all players into five categories according to their position, I calculated each position’s average performance based on the following categories: field goal percentage (FG%), total rebounds (TRB), steals (STL), blocks (BLK), assists (AST), and total points (PTS).
Some positions performed better on more categories than other positions.
For instance, forward-center players performed significantly higher on more performance categories than other positions, so they have four best qualities.
On the other hand, forward players performed were more specialized and performed well in two categories.
Here’s what I determined were the best performance qualities for each position.
Forward-Center: TRB, BLK, FG%, PTSForward-Guard: AST, PTS, STLCenter: BLK, FG%, TRBGuard: AST, PTS, STLForward: FG%, TRBStep 4: Use Machine Learning to Create Ranked Clusters Within Each PositionWith the WNBA players divided according to their positions, I created clusters within each position using their best-performing qualities as my predictor variables.
I used an unsupervised machine learning model called K-Means Clustering to help me with this task.
What K-Means Clustering does is it groups data around central points, thereby creating distinct clusters of similar data.
The number of central points, represented by the letter k, is a parameter that needs to be determined beforehand.
Finding the best k to create ranked clusters for each position took many attempts.
I also knew beforehand that I wanted to cluster each position by the same number k, so there were some cases where the clusters for one position created cleaner groupings than others.
Ultimately I went with n_clusters = 8 since that number tended to have a relatively higher silhouette and lower inertia score across the board.
The higher the silhouette score the more distinct the groups are, and the lower the inertia score the more similar each data point is within a cluster.
In the image below, you can see how the silhouette score is pretty high for the Forward-Center clusters (see left), but low for Guard clusters (see right).
This means that clusters of players among Forward-Centers will be more distinct than the clusters of players among Guards.
After the players were categorized into clusters, it was up to me to determine the rankings and make tweaks to any of the assignments.
Clusters don’t automatically translate into rankings, and sometimes clusters form imperfect groups.
I’ll walk you through how I interpreted the clusters for Forward-Guards.
In the cluster visualization below (see left), there are eight clusters corresponding to n_clusters = 8.
Remember, I determined which performance categories to use for each position.
For forward-guards I chose assists, total points, and steals since they performed well in these categories.
Now to create the rankings, I referred to the centroid calculations (right).
The index of the centroid DataFrame refers to a specific cluster number, ranging between 0 to 7.
The cluster with the highest average tended to perform better across all performance categories.
Players in cluster 5 would be ranked as the best, in cluster 4 the second best, and so on.
However, if you take a look at the visualization you’ll see that cluster 4 only has one player in it.
It was easy for me to figure out that the player who belongs to cluster 4 is Elena Delle Donne of the Washington Mystics.
She scored the highest total points of all players in her position (her cluster point is farther than everyone else’s in the PTS visualization), and she’s a five-time WNBA superstar.
According to this ranking, she got placed in the second-best team?.Well, if you take a closer look, Donne performed lower than average in assists and steals compared to other players in her position.
However, if you take a closer look at how her cluster (4) compares to cluster 1 which has more players in it, they significantly outperform her in the steals and assists categories.
So because of this, I swapped positions between clusters 4 and 1.
The new ranking became cluster 5 is the best, cluster 1 the second best, cluster 4 the third best, and so on.
I scrutinized all five positions’ eight clusters and ranked them from best to worst within each position.
Now it’s time to form the teams.
Step 5: Create Eight Ranked Teams With All-Time WNBA PlayersSince basketball teams are made up of five positions, I took the ranked clusters from each position and re-joined them into teams.
The size of each team varies, and so the number of players in each position.
Since the distribution of performance for most categories is right-skewed, there are fewer players who perform significantly better than everyone else in each category (see image below).
Therefore, the best team out of the eight will have fewer players in it.
I named the teams “Team 1” through “Team 8,” with Team 1 being the highest performing and Team 8 being the lowest performing.
For the sake of brevity, I’ll go over the top two best teams, and you can access my GitHub repository (linked at the end of this article) to see the rest of the details.
Team 1 has 35 players and Team 2 has 56 players.
Both teams have all-star players, but it just turned out that the players on Team 1 tended to perform better in more than one skill than other players in their position.
Therefore, Team 1 should be the team with players who are the most well-rounded in terms of skill abilities.
Since I did not use the number of games played to determine the rankings, each team has a good mix of rookie and veteran players.
For instance, veteran Cynthia Cooper ranked number 1 in Team 1 for the highest point average per game, and rookie A’ja Wilson ranked number in Team 2 for the highest point average per game.
Players who had the highest point average per game in Team 1 and Team 2.
Image sources: left, right.
Below is the final roster for Team 1 — the dream team — comprised of 16 Guards, 12 Forwards, 3 Centers, 2 Forward-Guards, and 2 Forward-Centers.
To name just a few highlights: Cynthia Cooper has the highest point average per game, Katie Smith has played the most games, Tamika Catchings has started the most games, Chiney Ogwumike has the highest three-point percentage average per game, and Brittney Griner has the highest block average per game.
Step 6: Simulate a Data-Driven Game Between the Top Two TeamsTake a seat, everyone.
The ultimate (data-driven) WNBA game is about to begin!.To tip off the game, let’s go with the players from each team who have started the most games: Tamika Catchings with 448 games started from Team 1 and Sue Bird with 508 games started from Team 2.
It’s a close call with offensive rebounds.
Team 1 has a higher mean average for making offensive rebounds, but their distribution is more skewed than Team 2 who has a similar mean and median for offensive rebounds.
If Team 1’s outliers are off the court, Team 2 has a higher chance of making more offensive rebounds.
But there’s no denying it with defensive rebounds, Team 1 performs significantly better in this category than Team 2.
The three-pointer success rate is going to be high in this game, giving you more opportunities to cheer at the sight of that ball flying across the court and making it effortlessly into the hoop.
Both teams have a high percentage of making three-pointers.
For the moment you’ve all been waiting for, the winner of this game.
Well, Team 1 beats Team 2 by having a points-per-game average that’s four points more than Team 2.
But you never know — if the outliers of both teams who score an average of 20 points per game play against each other, it’s safe to say that the game could be a close call.
ConclusionWatching sports through data alone was more fun than I thought it would be.
I’m pleased with my results and looking ahead, I’ll probably go back to my data to refine my feature selection process and model parameters.
In addition to envisioning my daydream of having all women’s sports players play on the same platform, I believe I achieved something even greater through this process: giving the WNBA the spotlight it deserves.
Considering that sports data analysis is a thriving field, there needs to be more attention paid to women’s sports data and analysis of that data.
Lastly, I felt so inspired witnessing (through data) the feats of the greatest WNBA players of all time.
I was never really a serious sports fan, but this project made me fall in love with the WNBA.
To the WNBA community, you now have a new fan thanks to data science.
LA Sparks showing off their three-pointer skills.
To view the code and analysis referenced here, view my GitHub repo.
Thanks to Shawn Vasandani and Anuva Kalawar for helping me understand basketball better.
And thanks to Riley Dallas for helping me think through my modeling process.
Jasmine Vasandani is a data scientist and a proud fan of the WNBA.
You can learn more about her here: www.