“I don’t Like Cricket…I love it”Web Scraping Meets The TidyverseRashid KazmiBlockedUnblockFollowFollowingJun 15IntroductionCricket is a bat-and-ball game played between two teams of eleven players on a field at the center of which is a 20-meter (22-yard) pitch with a wicket at each end, each comprising two bails balanced on three stumps.
The batting side scores runs by striking the ball bowled at the wicket with the bat, while the bowling and fielding side tries to prevent this and dismiss each player (so they are “out”)(Wikipedia).
com)MotivationCricket’s quadrennial showpiece, the ICC Cricket World Cup 2019 began in the ‘home’ of cricket, England earlier this month.
There was a lot of hype about the latest edition of the World Cup before it even began.
ESPNcricinfo has become a popular website for cricket fan to access all cricket games and players data.
The motivation behind this blog is how to scrape useful information off website and generate some basic insights from it with the help of R.
In this first blog I present an extensive exploration of the “International cricket results from 1971 to 2019” data set.
This data set lists all the international cricket matches(as in games opposing countries or nations) from 1971 to early 2019.
More specifically, this blog will cover the following:We’ll first learn how you can scrape ESPNCricinfo.
comto gather different teams records to dateThen, we’ll see some basic techniques to extract information off of one page: we’ll extract playing teams, winner, margin, ground, match date and scoreboard for all once day international (ODI’s) team record by yearAnd individual team score of all the ODI’s on a subpage for all matches by yearWith these tools at hand, you’re ready to step up your game and compare the matches of different cricket teams (of our own choice): we’ll see how you can make use of tidyverse packages such as plotting and dplyr, in combination with stringr, to inspect the data further and to formulate a hypothesis for further investigation and statistical inference that follows the philosophy of the tidyverse.
Web Scraping ESPNCricinfo.
com : rvestStep 1: PreparationsStep 2: Scrap The Team Records Step 3: Extract Scorecard URLSStep 4: Scrap The ScorecardsStep 1: PreparationsTo begin with, I made a vector of the years I want to scrap.
Step 2: Scrap The Team RecordsIn the next step, I applied map this function to the list of URLs I generated earlier.
To do this, I used the map() function from thepurrr package to the `rvest` functions to the year-url data frame.
Step 3: Extract Scorecard URLSExtract the urls to the scorecard from thehrefattribute.
Then take that list of URLS and scrape the data I was looking for, and then stick it into a data frame after some preprocessing.
Step 4: Scrap The ScorecardsI map the rvest functions to the scorecard urls.
Since this was a large number of urls.
I used progress bar progress package.
Data Cleaning and Wrangling With TidyverseOne of the big issues when it comes to working with data in any context is the issue of data cleaning and merging of datasets, since it is often the case that I collated data from ESPNCricinfo.
There are a myriad of ways in which R can used for the data wrangling but I relied heavily on tidyverse.
I used tidyr dplyr for differnet data wrangling and reshaping tasks.
Finally, I wrote convenient function that takes as input from the scoreboard data frame.
It extracts all games score binding them into one tibble.
Thenmap function applied to get the data frame with the needed information for team records.
The resultant dataframe was joined after some processing.
Visual Data Exploration With TidyverseData visualisation is a critical tool in the data analysis process.
Visualisation tasks can range from generating fundamental distribution plots to understanding the interplay of complex induential variables in machine learning algorithms.
With the dataset created I will visualise the distribution the ODI matches played over years etc.
I will ggplot2 to create the main graphic, along with some plots looking at trends in loosing, winning for top ranked cricket teams.
Number Of Matches Per YearInternational Cricket Council (ICC) gives points to all the teams based on their performances in different tournaments and bilateral series.
These points are then used for ranking of the teams.
The ranking helps in keeping a healthy competition among the countries to keep fighting for victories.
Based on the ranking I have taken into only top 10 teams for exploratoration and analysis purpose.
Number Of Matches For Top TeamsWorld Cup 2019 Playing TeamsThe British Empire had been instrumental in spreading the cricket over- seas and by the middle of the 19th century it had become well estab- lished in Australia, the Caribbean, India, New Zealand, North America and South Africa.
However, I am going to focus only the teams which have qualified for world cup 2019; include Afghanistan, Australia, Bangladesh, England, India, New Zealand, Pakistan, South Africa, Sri Lanka, West Indies.
ICC Cricket World Cup 2019 Teams Final List (https://dailysportsupdates.
com)Which World Cup Playing Team Has Best Win RatioODIs, it is India who lead the pack.
West Indies, South Africa, England and Bangldesh have recorded wins in the format and sit pretty at the top 5 of the list.
Every team has now played at least once time to predict which country based on the performances we have seen thus far is going to win is still far fetched.
However, both India and England seem good contenders for title provided they can keep up thier current playing form.
Conclusions:We identifed how we can split web scraping in different phaeses which have their own challenges to be attacked: the site analysis phase, the data analysis and design phase and the production phase.
In each of these phaseswe mentioned a number of activities to be carried out and questions to be answered before going to thenext phase .
In this blog we have seen how rvest package of R for web scraping was applied in the area of statistics.
We have showed that how web scraping is used in circumstances such as to explore background variables and to re- trieve metadata and how it can be combined magnificently with tidyverse package of R.
We have at the ICC best ODI cricket teams, which are the ones with the highest win ratios.
We have provided plausibe isnight for predicting likely `ICC Cricket World Cup 2019`, which I am gonna take in next blog.
References:“I don’t Like Cricket…I love it!”: Bob Marley“Cricket”.
Wikipedia, International Cricket Council (ICC) “Statistics and records”: ESPNcricinfo.
com“ggplot2”: H Wickham — elegant graphics for data analysis“Rvest”: H Wickham — Easily harvest (scrape) web pages“The tidyverse”: H Wickham — R package“gist-syntax-themes”: https://github.