Pandas for Football AnalysisA Merging and Scraping DataFrame Example using Football League DataStephen FordhamBlockedUnblockFollowFollowingMay 23IntroductionThis tutorial will centre on how to merge DataFrames scraped from different online sources.
To begin, I will merge the Premier League Table, available from, Wikipedia, with statistics relating to the average time a team spends leading, level or trailing over the course of a season.
This information will be scraped from a separate source, soccerstats.
The 2018–19 Premier League season has just wrapped up.
Whilst the race for the title, and the relegation battle have made most of the headlines, it would be intriguing to know how many minutes on average a team needs to be leading in a Premier League match throughout the season to secure European Football.
When each Premier League campaign finishes, prize money is distributed amongst the teams depending on their league finish position.
Through scraping data, and tabularising it into a DataFrame, you can clearly see the impact of Tottenham Hotspur’s draw with Everton on the final day of the season which cost them a third-place finish.
ScrapingI import the Pandas module as its alias pd following convention.
Pandas has a built in function, read_html which uses libraries like lxml and BeautifulSoup to automatically parse tables out of html files as DataFrame objects.
By default, the pandas.
read_html function searches for and attempts to parse all tabular data contained within <table> tags.
The result is a list of DataFrame objects.
I use the pandas function pd.
read_html() to scrap the Premier League Table from Wikipedia and the leading_level_trailing table form soccerstats.
For a quick review how to do this and basic clean-up (it’s simple), please see my article; Pandas in the Premier League.
This results in the following two DataFrames depicted below.
To save space, I have only printed their first 5 rows using the .
Merging the two DataFramesThe ‘Premier_league’ Table and the ‘leading_level_trailing’ tables can now be merged.
In this case, to merge, I create a new column that is shared between both DataFrames.
I could have chosen to use the existing team name column, but many of the entries actually have slightly different names and padding.
For example, ‘Manchester City (C)’ is the column value in the premier_league table, but ‘Manchester City’ is listed in the leading_level_trailing table.
As there are many instances of different names, I chose to insert a new column with a simple three letter abbreviation for each team.
This saved renaming each team so the corresponding rows would match between the two DataFrames, but there are many ways to do this.
B: When I created the new column, with the team shorthand, the entries have to marry up to their team, for example the first team in the ‘leading_level_trailing’ table is Arsenal, therefore the first element in my Team_shorthand list is ‘ARS’ and so forth.
I create the same Team_abbreviation in the Premier_league table.
I can now merge the two DataFrames.
I decide a new variable name for the merged Table, and use the pd.
merge function to merge.
I list both tables in parentheses, using the on= parameter to select which column to match on and the how= parameter to keep all the rows in the left, Premier_league table.
After organizing the DataFrame a bit more thoroughly, for example removing duplicated column names (e.
Team_y), I check the datatypes of the columns from the merged table ‘leading_level_trailing_stats’, and convert the columns of type ‘object’ (a python string) into a numeric datatype that we can apply numeric filter conditions on.
To answer the original question, we apply filter criteria to the table.
This tells us, that a team need to be leading on average for 31.
7 minutes to secure European football!Merging the Newly Constructed DataFrame with Scraped Non-Tabular dataA prize money column added to this newly formed ‘Merged_premier_league_table’ would make it even more informative.
We could start to draw relationships between the average minutes leading in a Premier League game, and Prize money received.
However, in this case, when I go to a website, for example RadioTimes, the data I need is not in a table format I can scrape using pd.
This means I have two objectives.
Firstly, to scrape the data, and secondly I need and turn it into a DataFrame that I can merge with the newly formed ‘Merged_premier_league_table’.
I first find the data I need from RadioTimes.
Sourced from RadioTimesWhen I right-click and select inspect I can see the underlying HTML source code.
The information I required is nested within the HTML <li> tags which represents list items.
These li tags sit within a HTML <ol> Tag.
To begin scraping this data, I import the requests module, and place the URL of interest in parentheses within the requests.
I next import BeautifulSoup to parse this HTML code, and set the parser parameter to ‘html.
I then use the soup object’s findall method to find all occurrences of ‘ol’.
This returns a results set object which acts like a python list with a list length of 2.
I check both lists to find where my information resides.
I index the second list under the variable name prize_money_scrape.
To see the contents of this list, I use the contents attribute.
Whilst this list gives me the information I require, each list item is separated by a newline character.
As this follows a regular pattern, I can use index notation to provide a step to skip out these newline characters.
I start my new earnings list at element 2 (python’s list 1), and increment with a step size of 2, [1::2].
Now that the newline characters have been eliminated from the list, I need to extract the data from within the <li> tags.
To do this, I create a new empty list called records, iterate over the elements in my earnings list, and extract the text from the result set object, before appending to my new list.
To confirm this are working as intended I print the new records list.
One of the things I really like about Pandas, is that you can turn a list into a DataFrame.
It is very simple to do so.
I can now turn my records list into a DataFrame by passing it to the pd.
I create a new column title within curly brackets in quotations, separate by a semi-colon, then pass in my list, before closing the curly bracket.
When I visualise my one-column DataFrame, I can see it has worked as intended.
The new DataFrame has one column, will all the information dumped inside.
What would be ideal, is if the ‘Earnings’ and ‘Team’ were in separate columns.
To do this, I split the string on the common hypen (-), and use the get method to extract the first element for Earnings and the second element for Team as shown below.
To merge, with the existing Premier League table, I create a new common column that will be shared between the newly scraped Premier_league_prize_money DataFrame, and the ‘Merged_premier_league_table.
I also remove the prefix ‘£’ sign and the suffix ‘m’ character to convert the ‘Earnings’ columns into a numeric datatype.
To finally clean things up I change the name of the Team_x column to team, and change the ‘Earnings’ column to ‘Prize_money_millions’ so it is explicitly clear what the numeric value in that column represents.
To finish, it would be useful to construct a relationship between the continuous variables, ‘Prize_money_millions’ and ‘Average_Mins_leading’.
It would be reasonable to expect a strong correlation between average minutes leading in a Premier League match and prize money achieved, but to be unambiguously clear, I can draw a scatter plot with a regression line using the regplot function from the seaborne library using the code snippet shown below (here, make sure seaborn and matplotlib are install via the pip command and imported into your python script).
To determine the correlation, I use the DataFrame.
I have set the optional method parameter to ‘pearson’, but kendall’, and ‘spearman’ can also be supplied as optional arguments to override the default, ‘pearson’.
As expected the correlation is strong between these two variables.
As can be reasonably expected, more time you spend leading, equates to higher Prize money received.
Finally, to conclude, lets see the impact that the final day draw had on Tottenham’s Prize money.
£2 Million!Football really is a results business!.. More details