The last few weeks we’ve been learning about SQL databases, classification models such as Logistic Regression and Support Vector Machines, and visualization tools such as Tableau, Bokeh, and Flask.
I put these new skills to use over the past 2 weeks in my project to classify injured pitchers.
This post will outline my process and analysis for this project.
All of my code and project presentation slides can be found on my Github and my Flask app for this project can be found on AWS.
Challenge:For this project, my challenge was to predict MLB pitcher injuries using binary classification.
To do this, I gathered data from several sites including Baseball-Reference.
com and MLB.
com for pitching stats by season, Spotrac.
com for Disabled List data per season, and Kaggle for 2015–2018 pitch-by-pitch data.
My goal was to use aggregated data from previous seasons, to predict if a pitcher would be injured in the following season.
The requirements for this project were to store our data in a PostgreSQL database, to utilize classification models, and to visualize our data in a Flask app or create graphs in Tableau, Bokeh, or Plotly.
Data Exploration:I gathered data from the 2013–2018 seasons for over 1500 Major League Baseball pitchers.
To get a feel for my data, I started by looking at features that were most intuitively predictive of injury and compared them in subsets of injured and healthy pitchers as follows:I first looked at age, and while the mean age in both injured and healthy players was around 27, the data was skewed a little differently in both groups.
The most common age in injured players was 29, while healthy players had a much lower mode at 25.
Similarly, average pitching speed in injured players was higher than in healthy players, as expected.
The next feature I considered was Tommy John surgery.
This is a very common surgery in pitchers where a ligament in the arm gets torn and is replaced with a healthy tendon extracted from the arm or leg.
I was assuming that pitchers with past surgeries were more likely to get injured again and the data confirmed this idea.
A significant 30% of injured pitchers had a past Tommy John surgery while healthy pitchers were at about 17%.
I then looked at average win-loss record in the two groups, which surprisingly was the feature with the highest correlation to injury in my dataset.
The subset of injured pitchers were winning an average of 43% of games compared to 36% for healthy players.
It makes sense that pitchers with more wins will get more playing time, which can lead to more injuries, as shown in the higher average innings pitched per game in injured players.
The feature I was most interested in exploring for this project was a pitcher’s repertoire and if certain pitches are more predictive of injury.
Looking at feature correlations, I found that Sinker and Cutter pitches had the highest positive correlation to injury.
I decided to explore these pitches more in depth and looked at the percentage of combined Sinker and Cutter pitches thrown by individual pitchers each year.
I noticed a pattern of injuries occurring in years where the sinker/cutter pitch percentages were at their highest.
Below is a sample plot of four leading MLB pitchers with recent injuries.
The red points on the plots represent years in which the players were injured.
You can see that they often correspond with years in which the sinker/cutter percentages were at a peak for each of the pitchers.
Another trend that I noticed from looking at these plots for several pitchers, was that injuries often occurred the same year or the year following the initial introduction of sinkers/cutters to a pitcher’s repertoire.
Modeling:My next step was to throw all of my features into a few classification models.
With injured pitchers only accounting for about 28% of my dataset, I first had to deal with my class imbalance.
I turned to Random Oversampling to balance my classes for input into Logistic Regression, KNN, Linear SVM, and Random Forest models.
I also tested Synthetic Oversampling in Linear SVM models with Smote and Adasyn which yielded slightly worse results.
I trained my models on injuries in the 2015–2017 seasons and compared their scores on a validation subset of this training data.
My models scored as follows:As you can see from the table above, the Linear SVM with Random Oversampling scored the highest.
I used the area under the ROC curve as my model scoring metric since it provides an interpretable assessment of the model’s rate of true positives to false positives.
Below is a plot of the ROC curves for each of the models I tested.
An ideal ROC curve would touch the top left corner of the graph, so we are looking for the model that comes the closest to that corner and has a consistently increasing true positive rate.
In this case, the bold blue line representing the Linear SVM model scored the best for my dataset at most thresholds on the ROC curve.
I selected the linear SVM model, retrained it on my training and validation data, and scored it finally on my hold out data which was the 2018 season.
My model scored a 0.
7072 area under the ROC curve which I was very happy with!.My model provided a significant improvement to random injury guessing and gave me some insight into controllable features (such as sinker/cutter pitching percentages) that a coach could conceivably look at to fend off injury in high risk pitchers.
Flask App:Main Search Page of AppTo display all of my above findings, I created a Flask App which includes player pages for all of the pitchers in my database.
On the main page, you can type in your favorite MLB pitcher (as long as they have played more than 5 innings between the 2013 and 2018 seasons) and you will be directed to their player page which includes pitching statistics, a pitch percentage breakdown by season, and a 2019 injury probability prediction from my model.
Sample Player Page in AppFuture Work:Going in to this bootcamp, I was really hoping that I would have the opportunity to work on a baseball related project and I had so much fun building this model and app.
In the future, I would love to build upon this project by narrowing down my predictions to a game-by-game level instead of season-by-season.
I also hope to expand my app to be more interactive and allow the user to play with different pitch percentages to see how the injury probability changes.
I am loving the opportunity to expand my knowledge through these projects.
I am also realizing that throughout all of my years as a pitcher, personally struggling to throw a sinker pitch, it might have been a blessing in disguise.