Hypothesis Testing European Soccer Data Using Python

Will our hypotheses tests yield valuable, statistically significant conclusions, or simply leave us with more unanswered questions?We used a European Soccer Database from Kaggle.

com to explore these hypotheses, given a sqlite3 database with 7 data tables covering over 25,000 historical soccer matches, 10,000 players, and teams (player & team ratings assessed by EA Sports) from 11 European countries from 2008–2016.

To summarize our approach in terms of individual statistical hypotheses, we’re running four 2-tailed 2-sample T-tests, at a threshold for rejecting or not rejecting each null hypothesis (alpha=0.

05).

In order to run these tests, our data must be sampled at random, approximately normally distributed, and independent.

Our tests yielded a variety of descriptive result metrics and helpful visualizations which can be viewed in more detail in our GitHub repositories linked at the end of this article, but I will stick to the most vital points below for the sake of brevity.

Test 1: Home Team vs.

Away Team Win RatesH0 (Null Hypothesis): there is no statistically significant difference between the avg.

win rate for home teams and the avg.

win rate for away teamsHA (Alternative Hyp.

): there exists a statistically significant difference between the avg.

win rate for home teams and the avg.

win rate for away teamsIs there any truth to home-field advantage?Test 2: 4–4–2 vs.

4–3–3 Win RatesH0: there is no statistically significant difference between the 4–4–2 win rate and the 4–3–3 win rateHA: there exists a statistically significant difference between the 4–4–2 win rate and the 4–3–3 win rateBlue = 4–4–2 Formation | Red = 4–3–3 FormationTest 3: Defensive Aggression Rating (English Premier League vs.

French Ligue 1)H0: there is no statistically significant difference between the avg.

defensive aggression rating of English teams and the avg.

defensive aggression rating of French teamsHA: there exists a statistically significant difference between the avg.

defensive aggression rating of English teams and the avg.

defensive aggression rating of French teamsTest 4: Shooting Chance Creation Rating (English Premier League vs.

French Ligue 1)H0: there is no statistically significant difference between the avg.

shooting chance creation rating of English teams and the avg.

shooting chance creation rating of French teamsHA: there exists a statistically significant difference between the avg.

shooting chance creation rating of English teams and the avg.

shooting chance creation rating of French teamsOverview of Process & ChallengesSQLite → PostgreSQL → Pandas DataFramesKaggle provides the 7 data tables in the SQLite database format, and to improve team collaboration we chose to move the data to a PostgreSQL database by instructing pgloader to load data from a SQLite file, which worked almost perfectly.

The Player table presented an error, so we worked around that by first transforming it to a csv, and loading directly into a DataFrame separately with the following code:import sqlite3import pandas as pdconn = sqlite3.

connect('database.

sqlite')df = pd.

read_sql_query("SELECT * FROM Player", conn)df.

to_csv('player.

csv',index=False)player_df = pd.

read_csv('player.

csv')player_df.

head()For tables that made it cleanly into PostgreSQL, we loaded them into Pandas DataFrames in the following way:conn = psycopg2.

connect('dbname=soccer')cur = conn.

cursor()query = '''SELECT *FROM Country;'''cur.

execute(query)countries_data = cur.

fetchall()countries_df = pd.

DataFrame(countries_data)countries_df.

columns = [i[0] for i in cur.

description]countries_df.

head()Home Goals & Away Goals columns → Binary Home Win & Away Win columns → Home Win Rates & Away Win RatesWe decided to use our knowledge of how many goals each team scored in each match to create 2 binary columns, where 1’s represent the home or away team winning and 0’s represent “not wins”, or the team drawing or losing.

Our win rate is essentially capturing wins over total matches, which is solely based on the chance of a particular team achieving a win given a played match, and does not penalize a loss and a tie differently (traditionally, leagues award 3 points for a win, 1 point for a draw, and 0 for a loss).

Win rates, as described, would be computed by taking the mean of these binary columns.

#initiate new columns with 0’s (there are other approaches)match_df['home_team_win'] = np.

zerosmatch_df['away_team_win'] = np.

zeros#set home team WINs equal to 1 in the new columnmatch_df['home_team_win'].

loc[match_df['home_team_goal'] > match_df['away_team_goal']] = 1#LOSS = 0match_df['home_team_win'].

loc[match_df['home_team_goal'] < match_df['away_team_goal']] = 0#TIE = 0match_df['home_team_win'].

loc[match_df['home_team_goal'] == match_df['away_team_goal']] = 0#repeat for away_team_win column#getting to a win rate for the entire datasethome_team_win_array = np.

array(match_df['home_team_win'])home_win_rate = np.

mean(home_team_win_array)#repeat for away_team_win columnReckoning test power, effect size, and sample sizeTo arrive at our ideal sample size in each test, we calculated the effect size (Cohen’s d), which takes into account the difference between averages and the pooled variances of the sample data.

We use this to feed into a calculation of our minimum sample size needed to achieve a desired alpha level (0.

05) and a desired power level (typically around 0.

8).

In sum, these decisions help to achieve a balance in the tradeoff between a test’s risk of returning a Type I error (rejecting a true H0 — a false positive) or a Type II error (failure to reject a false H0 — a false negative).

To our surprise, due to a rather slim difference between sample means, the calculations indicate that we would actually want far more samples to utilize in our analysis than we have available.

Because of this, the power of our test is significantly reduced in hypothesis tests 2,3, and 4, increasing the risk of a Type II error.

We moved forward in running the tests despite this fact, but a more ideal scenario would allow us to achieve a larger statistical power before we could conclude the tests with confidence.

The case for bootstrappingOne goal of inferential statistics is to determine the value of a parameter of a population, which is often expensive or impossible to measure directly.

Statistical sampling helps us overcome this challenge.

We sample a population, measure a statistic about it, and use this to hopefully say something meaningful about the corresponding population.

In the case of our first hypothesis test, we want to be certain that home teams have a particular mean win rate.

It’s not easy to collect results from every European soccer game ever played, so we sample 25K matches from 2008–2016 and say that the mean win rate of the population falls within a margin of error from what the mean win rate of our sample is.

Suppose we want to know with greater accuracy what the mean home team win rate is for that period, but all we have to utilize are those samples.

It seems that the margin of error from before is likely indicative of our best guess.

However, we can use bootstrapping to improve that guess.

To do this, we randomly sample with replacement from the 25K known matches.

We call this a bootstrap sample.

With replacement, this bootstrap sample is most likely not identical to our initial sample.

Some matches may appear in our sample more than once, and others may be omitted.

Using Python, we can create thousands of iterations of these bootstrap samples quickly.

The bootstrap method yields a resulting sample of estimations that often form a normal distribution, which we can summarize with measures of central tendency and variance.

Loosely based on the law of large numbers, if we sample over and over again we can arrive at a sort of “mini population”.

This applies even when you’re using a single sample to generate the data, thanks to bootstrapping and with the help of fast computers.

#example of bootstrapping code applied to our home team win binary columnsample_means_home = []for _ in range(1000): #number of iterations sample_mean = np.

random.

choice(home_team_win_array,size=202).

mean() #sample size: 202 sample_means_home.

append(sample_mean)len(sample_means_home) #should be 1000ConclusionsTest 1: Home Team vs.

Away Team Win RatesConclusion: Reject H0 in favor of the HA.

There is a significant difference in win rate for home teams vs.

away teams.

Home-field advantage definitely exists!.Below is a simple visualization we created aside from our hypothesis test to illustrate this by team and for the overall data:Most teams tend to win more on their home fieldTest 2: 4–4–2 vs.

4–3–3 Win RatesConclusion: Reject H0 in favor of the HA.

There is a significant difference in win rate for 4–4–2 and 4–3–3.

Based on our data, 4–4–2 is the better formation in terms of win rate, though we would be able to run a stronger test with more samples and/or a larger difference between the formations’ avg.

win rates.

4–4–2 is better on average, but our low power metric (0.

67) indicates Type II riskTest 3: Defensive Aggression Rating (English Premier League vs.

French Ligue 1)Conclusion: Reject the H0, with caveats.

There seems to be a difference in the avg.

defensive aggression rating (EA Sports) of English teams over French teams, but our test lacks power (risking Type II error).

We can’t confidently say that there is a true difference in avg.

defensive aggression between leagues.

We would be able to run a stronger test with more samples and/or a larger difference between league averages.

Test 4: Shooting Chance Creation Rating (English Premier League vs.

French Ligue 1)Conclusion: Reject the H0, with caveats.

There seems to be a difference in the avg.

shooting chance creation rating (EA Sports) of English teams over French teams, but our test lacks power (risking Type II error).

We can’t confidently say that there is a true difference in the avg.

shooting chance creation aptitude between leagues.

We would be able to run a stronger test with more samples and/or a larger difference between league averages.

Thank you for reading our brief, hopefully digestible overview of our European Soccer hypothesis tests!To dive into the details of our data, code, and statistical tests, explore our GitHub repositories: Connor — Kevin — AlexConnect with us on LinkedIn to share your thoughts and questions, or to follow our data science journeys: Connor — Kevin — Alex.. More details

Leave a Reply