Can you accurately predict MLB games based on home and away records?Take out payroll, batting averages, ERA’s, and any other sabermetric you can think of.
Let’s go old school for a minute: wins and losses.
Payton SoicherBlockedUnblockFollowFollowingMay 12Let’s say you’re a professional sports gambler watching two teams playing against each other on a September day, the Sea Monsters vs the Iron Tanks.
The Sea Monsters are 87–63 (.
580 winning pct), while the Tanks have a 40–90 (.
267 winning pct) record.
On paper, it seems like you would put your money on the Sea Monsters to run over the Tanks, but there’s a catch.
The Iron Tanks are invincible at home with a 37–3 (.
940 winning pct) home record, and the Sea Monsters struggle on the road with a 20–40 (.
333 winning pct) record.
If you put money down on the Sea Monsters since you believe they’re the overall better team, how risky is your bet?The answer to this problem can be used with Bayes’ Theorem.
Bayes’s Theorem is used to calculate conditional probabilities, meaning trying to predict an event given you know some underlying information.
For this example, you can make a more informed decision about the outcome of the game due to the facts that you know each team’s likelihood of winning a game overall as well as the location of the game.
In this article, I tackle these questions:Using the last 4 years of MLB data, how accurate of predictions can be made using Bayes’ TheoremWhen in the season you can safely determine that there is enough information for the predictions to be accurate.
Is there a way to hedge what bets should be made and which ones should not?How does it compare to other machine learning algorithms?Bayes’ TheoremFor a quick recap of how Bayes’ theorem is calculated, you can look at the formula on the left.
In this case, we can change P(A|B) to be P(W|H), which would mean the probability of winning a game, given that we know that the team is at home.
We can then also use P(W|A) as the probability of winning a game, given the team is away.
Now, lets plug in the information for the Sea Monsters and the Iron Tanks:P(W) = Probability of Winning, P(L) = Probability of Losing, P(W|H) = Probability of Winning given the team is at home, P(W|A) = Probability of Winning given the team is awayFrom this analysis, we can see that the Iron Tanks have a 85.
26% chance of winning at home given their records, and the Sea Monsters have a 40.
81% chance of winning on the road given their records.
Since these two probabilities don’t match up to 100%, we can add them together and find their portion of the overall matchup.
Taking the partial winning percentages of both teams can give us a better head to head matchup predictionWe can see that for this game, even though the Iron Tanks have a much worse record than the Sea Monsters, they have a Bayesian probability of 67.
63% to win the matchup.
Being the professional sports gambler that you are, you can take this information, check it against the money line of the game, and make a determination if the risk of betting on this game with a spread of 35% (67.
37% = 35.
26%) is worth the reward if you correctly predict the outcome.
Major League Baseball Case StudySince each game can reference the overall record, plus the record of the team’s home / away performance, you can think that as a season goes on, the results of the accuracy of the predictions should get better with time.
Using the final winning percentages of each team, we can see which team we predict to be the winning team and if the predictions was correct to the outcome.
First, let’s look to see if the accuracy seems to converge to a specific outcome as time moves on:My initial reaction to this graph was not what I thought it was going to look like.
I was correct in assuming the variance of the accuracy would be large at the beginning of the season and small at the end of the season since there is so much unknown information about the teams at the beginning of the season but the end of the season should have a lot of information to not move the accuracy in a significant direction.
However, I thought that with the additional information, the predictions would get better with time, which didn’t really seem to be the case.
From May through October, predictions did not get below 50% while not touching 60%, just hovering around 55% accuracy.
Although it wasn’t at a staggering accuracy, being roughly 55% correct on all games throughout the year is pretty good!.It beats blindly flipping a coin to determine the outcome, but the next question should be between wins and losses, how bad were the predictions?This is where I thought the analysis was most interesting.
Looking at the difference in the home team’s winning % vs the away teams winning %, the Bayesian probabilities weren’t due to small misses on wins and large misses on losses, they were roughly even between the two.
For example, if for a lot of the games that you incorrectly predicted, if the winning percentages between the two teams were less than 10% (55% to 45%), then that would be understandable.
But, if the model was incorrectly picking games that had a large difference in winning percentages (85% to 15%), then that would not be a good thing.
As you can see from the graph, regardless of the prediction result, the successes and misses roughly had the same gap in probability spreads.
Here’s another way we could look at this.
This is a box and whisker plot of the probability spread of correct and incorrect predictions.
In an ideal scenario, the correct predictions will have large spreads while the incorrect classifications will have small spreads.
This first plot shows early in the season.
You can see that a lot of games have huge spread gaps, with some of them being incorrectly classified and others being correctly classified.
Spread Probabilities Box and Whisker Plots of False and True PredictionsHowever, if we look at games in September, the spreads between correct and incorrect predictions aren’t significantly different, with still a few very bad predictions (look at false predictions in 2018).
Tying in Machine LearningMost people at this point are probably thinking “OK Payton, this is interesting and all, but I’m pretty sure a machine learning algorithm would do much better than your Bayesian calculations.
Well…let’s see how they match up.
Default models for Random Forest, Logistic Regression, K Nearest Neighbors, and Support Vector Machine classifiersWith the reminder that our Bayesian model having an accuracy score of 55% for each year, none of the machine learning models testing accuracies even reached that level.
This is more relevant when I used the Bayesian results in the model!.Even with additional help of outputs from models that are somewhat successful, it didn’t seem to help.
I use the testing accuracies because that would be the equivalent of looking at new prediction data.
Machine learning models tend to do much better on training sets because it can fit as close as it can to those data points, but on new data, it tends to not be as successful.
Even though the Logistic Regression had the worst training accuracy, it had the best testing accuracy, which makes sense.
This data set doesn’t have anything too in depth to make quality predictions where these machine learning models would thrive, so a simple Bayesian model could outperform a machine learning model.
ConclusionBayesian statistics can give an upper hand in making predictions, but there should be a word of caution when doing so.
As we saw with the accuracy of predictions throughout the year, you can expect to be right on roughly 55% of predictions from May all the way through the rest of the year, which in sports betting terms, is a pretty good use case to come out on top over the long run.
However, there shouldn’t be an increased bet on larger spreads as opposed to smaller spreads since the variance of spread amounts for both correct and incorrect predictions roughly have the same distribution.
This logic of analysis can be used for all types of sports.
Football would be tricky to do since there is such a small sample size of home and away games, but hockey and basketball both would have enough games over the years to make useful predictions.
One last takeaway from this is when you’re using data as simple as wins and losses, a simple model like the Bayesian prediction model might be a smarter choice than a more sophisticated machine learning model.
Machine learning models thrive on high volumes of data as well as in depth data points that have significant influences on outcomes.
If you held a dataset that had more data points (who the pitcher was, the weather, winning streaks, head to head history, etc), machine learning models would likely outperform the Bayesian model.
But, if you’re trying to come up with something simple and will result in more right than wrong predictions, a model like Bayes’ Theorem is a great route to go with.