Creating a winning bracket is hard and even trips up college basketball’s expert analysts.
Instead of leaving the guess-work to fate or watching thousands of hours of basketball each season (I guess I do that anyway, but that’s beside the point), why not train a computer to make predictions for you?With the help of Python and a few awesome libraries, you can build your own machine learning algorithm that predicts the final scores of NCAA Men’s Division-I College Basketball games in less than 30 lines of code.
This tutorial is intended to explain all of the steps required to creating a machine learning application including setup, data retrieval and processing, training a model, and printing final predictions.
SetupPre-RequisitesTo follow this tutorial, a basic understanding of Python is highly recommended though not required.
Knowledge of the purpose of importing modules, getting and setting variables, dictionaries, and instantiating classes are a good base-line to have, while experience with using Pandas and sklearn is a huge plus.
Development RequirementsPython 3: All of the code below will work with Python 2 as well, but Python 3 is recommended to help aid the transition away from Python 2 when it goes end-of-life in early 2020.
pandas, sportsreference, sklearn: Our required dependencies which will be explained in further detail below.
They can be installed via PIP with the following:pip install pandas sklearn sportsreferenceAn active network connection: This likely won’t be an issue for most, but the development environment that you end up using must have access to the external web in order to download a dataset from our code.
Building the ApplicationNow that our development environment is setup, let’s begin building the actual application.
Complete AlgorithmComplete machine learning program to predict college basketball scoresFor those that like to jump straight to the code, the gist above is our finalized program that we will use.
If you are already familiar with pandas and sklearn, you can skip to the bottom of this tutorial to see how this program is run and how it can be extended for higher accuracy, faster runtime, and improved usability.
For everyone else that wants a further explanation of this code, continue reading below to learn the purpose of each step.
Importing DependenciesImporting all required dependenciesNearly every Python program begins with an import section where required dependencies are included to be used later on in the module.
For this project, we need to import the following packages that we installed earlier:pandas: A popular data science library for Python which we will use to store and manipulate our dataset.
sportsreference: A free Python sports API that we will use to pull stats from NCAAB games.
More information can be found in this blog post.
sklearn: One of the biggest machine learning libraries for Python which includes several pre-made algorithms, such as the RandomForestRegressor we will be using, as well as useful tools to aid the data creation pipeline like train_test_split which creates training and testing datasets automatically.
Initializing the DatasetInitializing our dataset using sportsreferenceNo machine learning application would be complete without a dataset.
To help us predict final scores for NCAAB games, we want to create a dataset containing all of the individual game statistics (such as shooting percentage, number of turnovers and blocked shots, rebound percentages, and much more) which we can then use to predict how those factors correlate to final scores.
To create this dataset, we first need to initialize an empty Pandas DataFrame that we will use to store our final data.
Next, we initialize the Teams class from sportsreference which contains information for every NCAA Men’s Division-I Basketball team for the current or most recent season and allows us to easily grab statistical data on a team-by-team basis.
Prior to pulling data, we need to iterate over every team by running for team in teams: where each iteration corresponds to a unique team in the league.
sportsreference exposes schedule and boxscore information for each team and enables us to write code like team.
dataframe_extended which collects statistical information on a per-game basis for every game the team has participated in during the current season.
The dataframe_extended property returns a pandas DataFrame where each index corresponds to a different game.
After collecting boxscore information for each game, we want to add it to our overall dataset so we have one singular source of data.
This can be done by concatenating our existing dataset with the local DataFrame containing the current team’s complete boxscore information.
By overwriting our existing dataset with the resulting concatenation, we ensure that the dataset always includes information not only for the most recent team but all teams that were previously queried as well.
Preprocessing the DatasetPreprocess our dataset by dropping unused values, creating our X and y, and separating a training and testing datasetAfter our dataset finishes building itself, we need to filter out a few categories (or features as they are often called in machine learning) from our dataset that we don’t want to use — namely those that are of string type (or categorical) like the team names or the date and location.
Sometimes, string-based features can be useful, like in the case of predicting home values and determining properties listed as “waterfront” have a higher value than those classified as “inland” for example.
Though this feature is useful for house price predictions, most machine learning algorithms can’t handle string-based data.
One method of replacing these types of features is called one-hot encoding which auto-replaces similar categorical values with unique feature columns where every index that falls into that feature has a value 1 or a if it does not.
By changing the categories to 1’s and 0’s, machine learning algorithms are able to handle these features more effectively.
For our purposes, however, we will simply drop these features since they are either too numerous (ie.
possible venues the games can be played at is huge), meaningless (it shouldn’t really matter whether a game is played on November 18th or December 2nd in determining the result based on stats), or would introduce bias (we want the algorithm to determine the final score based on how a team’s playing — not just because their name is “Duke”).
As a result, we will drop all of those categories.
At this point, some might be wondering why I included home_points and away_points in the list of fields to drop.
These two fields are the final output (often referred to as labels) that we want to predict, so we do not want them to be included in our main features and should instead reserve them exclusively for our output labels.
Stepping through the code above, we first drop all of the unwanted features from our dataset and save the trimmed output as X.
After dropping unused features, we next remove all rows with incomplete data.
This sometimes happens if the data is not properly populated on sports-reference.
com or if a team didn’t perform a certain statistical action, such as not blocking a shot or shooting a free throw.
There are a couple of ways we can handle this incomplete data by either setting missing values with a set number (such as an average for the category or defaulting to zero) or dropping any rows that are invalid.
Since the number of invalid cells is very small for our dataset, we will just drop the any rows that have incomplete data as it will not impact our final results.
Since it takes two to tango (err, two participating teams for a game to be played), there will also be a copy of each game as the schedule for both teams are pulled (once for the home team and once for the away team).
This just pollutes our dataset and doesn’t provide any value since the rows are exactly identical, so we want to remove any copies and keep just one instance of each game.
To do so, we simply add drop_duplicates() to our dataset to ensure every index is unique.
Next, we need to create our output labels that will be used to determine the accuracy of our model’s weights while training and to test the accuracy of our final algorithm.
We can generate our labels by creating a two-column vector containing just the home and away points and set the result as y.
Finally, it is common practice to split your dataset into training and testing subsets in order to ensure a trained model is accurate.
Ideally, we want to use approximately 75% of the dataset for training, and reserve the remaining 25% for testing.
These subsets should be taken at random to prevent the model from being biased to a particular set of information.
After a model is trained using the training dataset, it should be run against the test dataset to determine the model’s predictive performance and see if it is overfitting.
Luckily, sklearn has a built-in function that will create these subsets for us.
By feeding our X and y frames into train_test_split, we are able to retrieve both training and testing subsets with the expected splits.
Creating and Training a ModelSetting hyperparameters and training our modelNow that our dataset has been processed, it’s time to create and train our model.
I decided to use a RandomForestRegressor for this example due to the algorithm’s ease of use and relative accuracy as well as its decent handling of reducing overfitting compared to standard decision trees.
The Random Forest algorithm creates several decision trees with some randomness injected into the feature weights.
These decision trees are then combined to create a forest (hence a random forest of decision trees) which is used for final analysis while training, validating, or inferring.
The algorithm supports both classification as well as regression, making it very flexible for diverse applications.
Classification determines output labels that belong to a fixed number of categories, such as the letter grade students received on a test (“A”, “B”, “C”, “D”, or “F”).
There can only be five categories (or classes), so the model will only attempt to place outputs into one of these five categories.
Regression, on the other hand, determines output labels that can take on an indefinite range of values, such as the price of a home.
Though there tends to be a range of standard home prices, there is no limit to the price a house could sell for, and any positive number is a valid possibility.
Since the final score of a basketball game can technically be any positive number (or zero!), we want to use regression.
Before we build and train our model, we first need to set some hyperparameters.
Hyperparameters are parameters that are input to a model prior to training and affect how it is built and optimized.
These parameters tend to be the biggest hurdle for most beginners in the fields of machine and deep learning as there generally isn’t a “perfect” value for these settings and it can get overwhelming to determine what should be put, if anything.
A general rule of thumb is to stick with the default values of these hyperparameters initially, then once a model is trained and completed and you are able to test it, begin to tweak the values using a trial-and-error method until you are satisfied with the final results.
For our model, I’ve chosen six different hyperparameters and found this particular set of values to provide the best trade-off between performance and accuracy.
More details on these specific settings can be found in the official scikit-learn documentation.
After selecting our hyperparmaters, it’s finally time to create our model.
First, we need to instantiate the RandomForestRegressor class that we imported earlier and include our hyperparameters.
By using (**parameters), we expand the key-value pairs of our dictionary to named arguments to the class which is functionally identical to the following:An example of how dictionary expansion works for function callsNow that our model has been instantiated, all that’s left is to train it.
sklearn makes this very easy by including the fit method with RandomForestRegressor, so we only need to run it with our input features and the corresponding output labels.
This method runs in-place, so our model variable will now automatically point to a trained model that we can use for predictions!Printing ResultsPrint the final resultsThe final step of our application is to run predictions against our testing subset and compare them with our expected results.
This print statement outputs both the predicted results as well as our actual expected results as two different two-column vectors.
Running the ApplicationFinally, the moment we’ve all been waiting for!.Our application is now complete and all we have left is to run the algorithm.
I named my program ncaab-machine-learning-basic.
py, so I simply need to run the following to initiate the algorithm:python ncaab-machine-learning-basic.
pyPlease note that the program may take a long time to complete as a bulk of the processing time is spent building the dataset for all 350+ teams in Division-I College Basketball.
If you just desire to see a working algorithm, you can stop the data creation early by adding a break statement in the first loop after the data concatenation line.
Once the program finishes, it will output something similar to the following (I reduced the number of lines to save space):(array([[86, 86], [71, 71], [78, 77], [74, 72], [90, 81], .
[52, 66], [68, 65]]),array([[ 83, 89], [ 71, 73], [ 80, 76], [ 77, 72], [ 92, 84], .
[ 46, 73], [ 66, 65]]))This output contains two sections: the predicted output followed by the expected output.
Everything from array([[86, 86] to [68, 65]]) is the predicted output while array([[83, 89] to [66, 65]]) is the actual data.
As was specified earlier, the first column refers to the expected number of points the home team will score, and the second column is the projected points for the away team.
The rows in the predicted output also matchup with the rows in the expected output, so [86, 86] correlates with [83, 89] and so on.
If we compare down the list, we will find that our predictions aren’t too bad!.For the most part, the projected score is only a few points away from the actual result.
Another promising sign is when the actual score varies from a typical result of around 70 points, our algorithm is able to identify a difference and generate a score that is higher or lower than what is considered normal.
Improving the ApplicationIf this is your first machine learning program, congratulations!.Hopefully this tutorial is enough to get you started and show that a basic machine learning application doesn’t require years of education or thousands of lines of code.
While this program is a great start, there are many ways we can extend it to make it better.
Here are several improvements I would make to the application to improve performance, accuracy, and usability:Save the dataset to local directory: As mentioned earlier, the program takes a long time to complete as it builds the dataset from scratch for all 350+ teams.
Currently, if you want to run the algorithm again, you will need to build the dataset all over again, even if it wouldn’t have changed.
This process can be short-circuited after the first time the dataset is built by saving a copy of the DataFrame to the local filesystem by converting it to a CSV or Pickle file.
Then, the next time the program is run, you can optionally test if a CSV or Pickle file is stored locally and, if so, load it from the file and skip building the dataset.
This will dramatically reduce the time required to run the program after the dataset is first saved.
Feature engineering: A common practice in improving machine learning models is known as feature engineering.
This pertains to the process of creating or modifying features which help the model find correlations between various categories.
Feature engineering is often a difficult task as, like with hyperparameters, there isn’t a defined method that you can use which will consistently improve performance.
However, some rules of thumb are to modify numerical features so they are in the same order of magnitude.
For example, our dataset contains many percentages as well as cumulative totals.
The percentages range from 0–1 while the totals can be any number greater than or equal to 0 (think “points” or “minutes played”).
Modifying these features so they are in the same order of magnitude can aid the creation of the model.
Another example of feature engineering is to create a new feature, such as the famous four factor rating for each team.
We can generate this new feature and include it with our dataset to determine whether or not it improves the overall model.
Display predictions for specific teams: While our program is a great introduction to machine learning and predicting basketball scores, it isn’t fully usable for determining outcomes of specific games or matchups.
A great extension would be to generate predictions for specific teams.
This way, we can answer questions like “Indiana is playing at Purdue.
What’s the score predicted to be?” As we did when we built our dataset, we can leverage sportsreference to generate data specific to individual teams and use that for our input while making predictions.
Write functions for similar blocks of code: To be more Pythonic and make our application modular for future changes, functions should be used for all blocks of code that have a specific purpose, such as building the dataset, processing our data, and building and training our model.
This also improves the readability of code to aid others who might use it in the future.
Now that you have a working application, try to implement some of these suggestions to improve the accuracy and performance of your model.
If you generate a model you are satisfied with, you can use it to create predictions for the NCAA Tournament or possibly enter a competition.
While it will still be tough to beat the Golden Retriever or Sally’s pet rock, this algorithm just might give you that competitive edge in your company’s pool this year.
Why not make this March yours and dethrone Jim from accounting?.. More details