To check whether these two variables have a high degree of correlation, we’ll plot them against one another and check the value of R.

The plot below makes intuitive sense.

As teams play opponents with lower all time winning percentages, their head to head winning percentage goes up.

However, this isn’t a perfect correlation.

There is a wide spread of points when we ignore the the four teams that have an all time win percentage of greater than 50%.

As such, we shouldn’t be worried about collinearity in this case and can feel comfortable including both features in our model.

R value of -0.

51Win StreakThis is a feature that intends to incorporate mentality and team confidence into the model.

We might expect that when a team wins a game, their odds of winning the next game are improved due to the team being in ‘good form’.

Of course, this could also be pure observation or confirmation bias.

In particular, it might be related to the NBA hot hand hypothesis that has been consistently disproved.

Implementing this feature should not turn out to be too difficult.

In much the same way as above, we’ll have to iterate over a list of all the teams, pull out a slice of all their games, update a ‘Streak’ column, and concatenate all the updated slices together.

Using an inner for loop and the dataframe method .

iterrows() will allow us to initialize a streak counter, check our team’s result, append that value to a list, and update the counter.

After iterating over every row in our slice, we’ll have a list that contains the win streak for our team of interest (resets to 0 if the team loses or draws).

This list can be translated directly to the ‘Streak’ column as it is the same length.

Let’s also do a quick check to see if this feature will have any amount of predictive strength.

In the plots below I’ve looked at 8 teams and their respective win percentages during a win streak and all time.

The first plot shows 4 of the most successful teams in Premier League history and the second plot shows 4 of the less successful teams.

What we notice is that the number of wins in a streak likely doesn’t have any bearing on win percentage.

That is, a team will not win more games after a 3 win streak than after a 5 win streak.

However, most teams win at a higher rate during a win streak, as compared to their all-time percentage.

This feature looks to be somewhat predictive.

Opponent’s League Finish Last YearOver the course of 38 games (1 Premier League season), where each team finishes in the league should be fairly indicative of their strength.

This may change significantly over several seasons but is unlikely to change much one season to the next.

As such, if we have information on where a team finished last year, we might expect to gain some predictive power from the information.

Unfortunately, our original data source doesn’t have information on league standings so we’ll have to go searching for more data.

Datahub.

io has just the information we need – free to download.

Due to some discrepancies in notation and season availability, getting the information from this source and into a feature for our model involved a fair bit of data munging.

Team names didn’t match, years were annotated differently etc.

Luckily Python’s built in string module made these changes quite simple.

If you’re interested in all the manipulations that took place, check the GitHub.

The league standings data also included other potentially useful information such as goals per game, shots on target per game and other statistics.

To improve model performance these could always be added as features.

For now, we’ll set them to the side and proceed to model building!3.

Train/Test/Tune Classifiers for High PrecisionMultinomial Logistic RegressionWhile not always quite as intriguing as some of the newer and more advanced supervised learning algorithms, regression models still have a lot to offer.

In the context of of classifying matches into 1 of multiple discrete targets (win, draw or loss), logistic regression is a great place to start.

At the very least, it will give us a baseline to compare with more complicated models.

Because our use case is trying to predict a single season worth of matches, the testing and training set will be manually assigned.

The training set will be every season from 1994–95 through 2015–16.

The testing set will be a single season, the 2016–17 Premier League Season.

Again, the 5 features in the model are head to head win ratio (vs given opponent), home or away, opponent’s league finish last year, opponent’s all time win ratio, and win streak.

The target is a variable with three classes, win, loss, or draw.

Fitting the model to the training set and running a grid search cross validation for an optimized C (=0.

1) results in a model score of 0.

595.

Because this isn’t very elucidating, we can instead look at the classification report for more in depth analysis.

Classification Report for Multinomial Logistic RegressionAs we can see in the report above, wins and losses have fair precision and recall but draws seem to be much harder to predict.

When thinking about the overarching goal of this project, we have to think about what element of classification is most important.

Because we would like to use our model to place bets, we want to be relatively certain that when the model predicts a result that it is correct.

Therefore to achieve best performance, we need to optimize for high precision and necessarily can forget about recall.

This being said, our first model has fairly poor precision.

Perhaps if we bet in volume we could make money on wins and losses, but an improvement in precision would help dramatically.

As such, a Random Forest Classifier would seem a good next step.

Without going into too much detail, however, the Random Forest Classifier performed worse than the Logistic Regression model.

This is a poor trade-off for increased complexity.

Also tried was a Support Vector Machine with similar results.

As it performed the best, we’ll stick with logistic regression and see if we can’t improve precision another way.

Binary Logistic RegressionLooking back at the classification report from the multinomial logistic regression model, we notice that draws are comparatively difficult to predict.

If we can merge two classes into 1, change the model to solve a binary classification problem, and increase precision we can up our potential profits.

Merging the draw category into either losses or wins is simple enough, we just have to decide which results we ultimately want to bet on.

Rather arbitrarily, we’ll choose to merge losses and draws, leaving wins as the singular class (the result we will be betting on).

Using the same procedures detailed above, training and testing a new model leads to the classification report below.

Classification Report for Binary Logisitic RegressionAs expected, the Loss/Draw category, now useless, has quite high precision and recall.

More importantly to our use case, the precision of the algorithm in predicting wins has gone up to 0.

71.

This represents a significant improvement from the 0.

60 in the original model with multinomial classification.

This increase in precision is worth the trade off in lost complexity.

If our model is correct 7/10 times that it predicts a team will win a game, then as a bettor, we can bet on only wins and be guaranteed profit in volume.

Again, recall is unimportant.

When our model predicts a win, we want to be relatively confident that it has predicted correctly.

With a model precision of 0.

71 in predicting wins, we can feel cautiously optimistic that we’ll make money when betting on an entire season’s worth of games.

Let’s find out if we can take that confidence straight to the bank!4.

Simulating BetsGetting the OddsFirst we will need the odds given by bookies for each game in our test set, the 2016/17 Premier League season.

Thankfully, football-data.

co.

uk has us covered.

They collect and maintain a massive dataset on betting odds from hundreds of different sources for each game in every Premier League season.

This is an incredible resource and is specifically maintained for practicing and informing betting strategies.

Perfect.

Ideally, we would find the best individual odds for each game where our model predicts a win and only place bets with that company.

However, with so many different companies and odds listed in the dataset, this is a massively complex task.

Instead, we will use the odds supplied by the website Bet365.

com, one of the world’s leading online gambling companies.

Without going into too many details, we will pull out the odds given for the full time result as supplied by Bet365 in decimal format for every team in the Premier League.

Every time our model predicts a win for a given team, we will simulate a $100 bet on that game (this costs us $110 with the fee required to place a bet).

If our model predicts a Loss/Draw we will abstain from betting.

Profit or LossTo calculate profit in the event of a correct prediction and bet, we take the decimal odds given for that particular result, multiply by our stake of $100 and subtract the original stake plus betting fee for pure profit.

For example, if the given odds for a home team win are 1.

65 and our model correctly predicts the result, we walk away with $165.

$100 of this is the original stake and $10 we pay no matter what.

Therefore, our profit on this bet is $55.

If we lose this bet (AKA our model predicts a win for a the home team but they don’t win that game), we lose $110.

Under this betting strategy and assessment of profit, lets simulate bets on all 20 teams in the Premier League for the 2016/17 season.

In total we placed 231 bets of $100.

With the added betting fee of $10 per bet, our financial outlay was $25,410.

No small chunk of change.

How did the model perform?.In total, we walked away with $29,380.

This is a pure profit of $3,970.

In the 9 months it takes to complete a Premier League season, our investment generated a return of 15.

5%!.See the table below to see the breakdown of bets and profit/loss by team.

Total bets placed and profit/loss for each team over a season of betting on the English Premier League5.

Conclusions (And Profit?)So our model led to a simulated ROI of 15.

5%!.Does this mean we’re going to take our model and go change the world of sports gambling?.Probably not.

When predicting soccer there is simply too much unpredictability.

Perhaps the 2016/17 season was one of the most easily predicted on record?.Or perhaps as more television revenue flows into the sport, past results will be less useful in predicting future games.

However, with enough capital investment it does seem that with this betting strategy and model there is some success to be had.

If nothing else, this has offered a fascinating case study into using domain knowledge to construct a relatively straightforward and interpretable model that can perform strongly with a narrowly defined use case.

If you have any questions or comments feel free to send me a message; I’d love to hear them!.One last time, check the GitHub if you’re interested in perusing or reusing any of my code.

Thanks for reading!.