Pandas for Football Player ValueA Web-Scraping and Pandas Guided TutorialStephen FordhamBlockedUnblockFollowFollowingJun 1In Football, January offers a chance for teams in the League to make changes to their squad.
With this in mind, this tutorial will focus on how to scrape information for the most expensive January Premier League signings ever recorded for each position, in a standard 4–4–2 formation.
Once scraped, the tutorial will show how to transform scraped data in a Pandas DataFrame, and demonstrate how this format of data structure can be useful for initial data analysis.
The Forbes Webpage shown below will be used and is provided here.
This Webpage has all the information of interest.
Notably, all the January signings in all positions, the club the players transferred from, and the club they transferred to.
In addition, the year that transfer took place, and their transfer fees in £’s and $’s, when the players were originally purchased, and their value now in £’s and $’s based on a ‘football player inflation rate’.
To begin, I import the requests library.
I then create a ‘url’ variable, and assign the URL link from the Forbes website to this variable.
This link contains information relating to the most expensive transfers in the January Transfer window.
I use the get method from the requests library to begin extracting my data.
To confirm the connection from my python environment to the Webpage, I print the status_code, which returns 200.
200 is the HTTP status code for “OK”, a successful response.
(Other codes you may be familiar with are 404 Not Found, 403 Forbidden, and 500 Internal Server Error.
)Once the connection is validated, I use the text attribute to retrieve the underlying html source code of the Forbes Webpage.
When I print the html text, the output is quite large in my console.
To help narrow down my area of interest, I use ‘CTRL-F’ whilst in the console, and search for the name of a player I know will lead me to where I would like to extract data from.
This search leads me to the information I require, immediately after the <strong> tag.
From a glance at the source output, I can see Shay Given, Newcastle to Manchester City, his transfer fee in £’s and $’s, when the player was originally purchased, and his value now in £’s and $’s.
What is particularly useful in this scenario, is the regularity in the underlying HTML source code.
For example, each player’s name is embedded the same way within a <strong> tag.
In addition, the Team the player transferred from is always after the ‘nbsp;’ characters, and before the word ‘to’.
This pattern means I can create a for loop to iterate over my code and extract the relevant information.
Initially, I like to narrow down my search area.
The output from htmltext = response.
text returns a giant string, which I can split, using the split string method.
I split on ‘*Fees reported’ and take everything after that string by indexing at 1, and split again on a unique string, and take the string before that by indexing at 0.
This leave me with a shortened string.
To convert the string, ‘secondsplit’ into a list, I split it at a strong tag and take everything after the second element (python index 1) until the end of my list.
This leaves me a list (thirdsplit) with all the information I want to extract inside for each player.
I can now iterate over my list.
To find the original transfer fee for example, I use the split method to split each element in my thirdsplit list, and extract the string after the ‘($’ sign and before the ‘m)’ characters by using substring notation.
I then append each transfer fee to a newly constructed list called transfer cost.
I repeat this same process for get the players, their previous club, year of transfer, new club, and fee as of 2019 when a football player inflation rate is applied.
In some instances, the split method for a particular variable in the for loop does not quite remove all the necessary characters.
When I run the for loop as shown, the prefix, ‘Jos&eacute;’ appears before the player Antonio Reyes.
To create the new players list, I simply add a condition to remove these characters, and generate a new list.
This involved checking for leading characters and replacing them with an empty string as shown in the attached GitHub gist code snippet.
The same validation was applied to the Old_team list.
Once the lists have been assembled and cleaned up I can create a pandas DataFrame.
To do so, I import pandas, and zip my lists together inside the pd.
I assign appropriately labelled column names for the lists so that they are representative of their content, for example the ‘Players’ column refers to the list, ‘Players_refined’.
The datatypes of the transfer_df DataFrame columns, Transfer_Fee, Transfer_Year and Cost_2019 are changed to appropriate numeric types using the .
astype() method, before the DataFrame is sorted based on highest Transfer_Fee using the .
I finally create two additional columns.
One which determines the ratio between the Cost of the player in 2019 (based on player inflation rate) and their tranfers fee at the time of purchase.
The second column refers to the player’s respective position.
An additional step I implement is the re-ordering of the columns.
It reads better if, for example the Player name is followed by the Player’s respective position.
When the transfer_df DataFrame is viewed all the necessary changes have been implemented.
The data has been successfully extracted from the Forbes Webpage and inputted into a Pandas DataFrame!An aspect of the pandas library which I really like is the groupby method.
Here, the data is in a very clear and readable form.
It is clear that in January, Arsenal signed the most expensive left-winger(LW), Antonio Reyes for $23.
Further to this, Chelsea have signed 4 of the 11 most expensive January signings (based on the 4–4–2 formation), 2 of which were in 2011.
This data is easier to interpret for comparison purposes that the original text displayed on the Webpage.
A bar-plot can now be produced to help visualize the relative change in market value, between time of original purchase to the present day.
Both the Transfer_Fee_change_column and the schematic illustrate this point more clearly than the raw text depicted on the Forbes Web-page.
This schematic shows that if Fernando Torres had been purchased in today’s transfer market, he would cost 2.
10 times more than he did in 2011.
However, from our table and schematic we can clearly see that Aymeric Laporte would cost less, 0.
84 times what was paid for him in 2018.
At the extreme end, if Scott Parker was purchased today, he would be 3.
37 times more expensive than his original 2004 transfer fee.
A normalization dotted line has been added at 1.
0 on the y-axis to shown the relative change in ratio.
ConclusionThis introductory example described how we can scrape data and transform it into a pandas DataFrame.
By doing so, the data can easily be manipulated and analysed.
In this case, it is now easy to group, based on club, create a new column which calculates the transfer fee change ratio between original purchase and the present market value.
Further to this, we can create informative plots, like the bar-plot shown in this article.
.. More details