The answer resides, by switching across to the developers tool in Chrome and using the console tab.
Here, I search for the table tag, and the result is an array with a length of 1.
This means only one table is present on this particular web page.
If I index, the first (and only) element I will be able to scrape the Table.
However, now we run into a slight bump in the road, so time for Pandas Tip number 2!.
Tip 2: Multi-Index to Single IndexWhen this table is scraped, a multi index is returned as shown.
In most instances, it is preferable to have a single index.
This will make data sorting and filtering easier.
Now, the highest-level column ‘Premier League Spending 2018–19’ needs to be removed.
To do this, I use the .
Within this method, the first parameter that I specify is the name of the column I want to get the cross section of.
Here I specify ‘Premier League Spending 2018–19’.
The axis=1 gets the cross section of the column and the drop_level argument=True, returns the cross section without the multilevel index.
Evidently, there is now a single-index, but there are still lots of spaces between the column names, which makes certain tasks, like filtering more cumbersome.
Ideally, the columns should all be single string elements.
Tip 3: Quick and Easy Column renamingThe column names are very informative of their contents.
I therefore only want to change them marginally, so downstream data manipulation is easier.
Here, I use the df.
replace method, and replace all spaces, hyphens and asterisks with an underscore.
These characters need to be removed because they can cause syntax errors when we attempt to filter (later on) for example.
It is therefore best practice to remove/replace them.
As a demonstration of how useful this string function is, I have include the equivalent code, commented out, that would be necessary to achieve the same outcome using the df.
As shown, it is a lot simpler.
Easy, when you know how!I now, perform some simple string manipulation functions as shown in the GitHub gist below to transform the string entries into numeric datatypes.
This Table has some interesting data, but what I really want is a comprehensive view of the Premier League Finances for the 2018–19 season.
So, onto Tip number 4.
Tip 4: Pdf to Pandas DataFrameNavigating to the Premier League news article, led to me information regarding the Premier League value of central payments to clubs during the 2018/19 season.
This page included a link to a downloadable pdf (shown below).
The next few tasks involve scraping this table and merging it with the spending table extracted from the BBC Football web page.
I imported the read_pdf method from tabula, and passed the file path to this method.
I initially assigned the DataFrame the variable name df, and checked its type.
A Pandas DataFrame is returned.
Simple clean-up and renaming followed, but I included this tip as it really demonstrates, how simple it can be to transform a pdf table into a Pandas Dataframe amenable for analysis.
Tip 5: Mutiple-column String ManipulationThe columns in the league_club_revenue DataFrame are a string datatype, and need to be converted to a numeric datatype.
However, through careful inspection, all the column values, with the exception of ‘Club_name’ and ‘UK_live_Broadcast’ are prefixed by a ‘£’ sign, and inter-spaced with commas.
To change these columns together, I simply create a list of the columns (Pandas series).
I then iterate over these columns in a for loop within the DataFrame, and replace the ‘£’ and ‘,’ with empty strings, followed by their conversion using the .
I also divided by 1 Million, because ultimately, I want to merge the ‘spending’ and ‘league_club_revenue’ DataFrames together, therefore I need the data to be consistent between these two DataFrames.
I need all columns to express their values in £ millions.
Almost one line of code cleans up the entire Dataframe!Tip 6: Simple MergingTo merge and successfully align the data, I produce a new column in the league_club_revenue Dataframe.
This is necessary as the teams are in a different order in both tables.
I create the new column, called ‘Team’ with the teams in the order, they appear in the league_club_revenue DataFrame.
I can then merge on this common column shared between the two Dataframes and the rows across each Dataframe will now align.
The new DataFrame, Financial_review is now the merged product.
The data formatting is consistent, with all column values, with the exception of ‘UK_live_broadcast’ and ‘Current_Position’ expressed as £millions (sterling).
Tip 7: FilteringThe newly merged DataFrame is now amendable for querying.
To start with a basic query, I specify 2 conditions.
I want to know the Teams who spent more than £8 Million on agent fees, and finished lower than 10th position (as of 4th of April, when the BBC article on agent and Estimated fees was written).
The results may suggest that the agents for these teams could do a better job, as the teams they scout players for sit in the bottom half of the table!Now suppose, we want to query another condition.
This time, however we do not want the entire record (row) returned.
To achieve this, simply write the filter condition followed by a period and then the column you want to return.
Here, I only want to know the name of the Teams who have been estimated to spend more than £60 million on transfers and received a payment merit of less than £20 million.
It looks like Bournemouth, Brighton and Fulham spend big, but do recoup that investment very well in Merit-based prize money.
Its not looking good so far, financially for these teams compared to the rest!For comparison purposes, I have included what the returned result would look like if I did not use not notation followed by the column, ‘Team’.
Here, the entire record or row is returned.
Finally, I will conclude, by demonstrating how a simple aggregate function can be used.
Firstly, the mean (in £ million) for agent spending for all the teams in the Premier league is calculated.
Two different ways to achieve this result are shown, the first with square brackets, and the second using dot notation (more commonly used), and hence why illegal characters such as spaces needed to be removed and replaced with underscores earlier.
Lets suppose you want to find out the mean UK Broadcast games for the top and bottom half teams of the Premier league separately.
To gain further insight, it would be interesting to determine the difference between average Broadcast games for teams in the top half of the table versus the bottom half of the table.
Simply use .
loc then select the first 10 rows using [:10 followed by a comma, and finally the column in quotations followed by a closing square bracket.
Repeat this for the second half of the table, [10: , and we can clearly see that teams in the top half have on average 8.
49 more UK broadcast games compared with teams that reside in the bottom half.
I hope this example has demonstrated some useful Pandas features to make data handling and management that little bit easier.
If you liked the Premier League example, I have written an introductory article entitled ‘Pandas in the Premier League’ which shows how Pandas can help with initial data clean up.