Now we will use urlopen that we imported from the urllib.
request library, then create a BeautifulSoup object by passing through html to BeautifulSoup().
# NBA season we will be analyzingyear = 2019# URL page we will scraping (see image above)url = "https://www.
format(year)# this is the HTML from the given URLhtml = urlopen(url)soup = BeautifulSoup(html)Here, the BeautifulSoup function passed through the entire web page in order to convert it into an object.
Now we will trim the object to only include the table we need.
The next step is to organize the column headers.
We want to extract the text content of each column header and store them into a list.
We do this by inspecting the HTML (right-clicking the page and selecting “Inspect Element”) we see that the 2nd table row is the one that contains the column headers we want.
By looking at the table, we can see the specific HTML tags that we will be using to extract the data:Now, we go back to BeautifulSoup.
By using findAll(), we can get the first 2 rows (limit = 2) and pass the element we want as the first argument, in this case ‘tr’, which is the HTML tag for table row.
After using findAll(), we use the getText() to extract the table header, or ‘th’, text we need and organize it into a list:# use findALL() to get the column headerssoup.
findAll('tr', limit=2)# use getText()to extract the text we need into a listheaders = [th.
getText() for th in soup.
findAll('th')]# exclude the first column as we will not need the ranking order from Basketball Reference for the analysisheaders = headers[1:]headersNext step, we will extract the data from the cells of the table in order to add it to our DataFrame.
Although it is similar to extracting data from column header, the data within the cell, in this case player stats, is in a 2-dimensional format.
Therefore, we must set up a 2-dimensional list:# avoid the first header rowrows = soup.
findAll('tr')[1:]player_stats = [[td.
getText() for td in rows[i].
findAll('td')] for i in range(len(rows))]Now comes something a bit more familiar: pandas.
With the data and columns headers we have extracted from Basketball Reference, we create a simple DataFrame:stats = pd.
DataFrame(player_stats, columns = headers)stats.
head(10)Voilà!.We have our own data set!.In a matter of minutes, we analyzed an HTML document, extracted the data from the table, and organized it onto a DataFrame.
For free!.And without any hassle of searching for a .
csv file that might not be up-to-date.
We have data that comes directly from a source that is updated everyday.
Of course, not every HTML document is created equally.
There are pages that might require a bit more time in analyzing the HTML tags in order to extract the proper data.
To conclude this quick walk through, web scraping can be a simple task, but for certain HTML documents, it can be difficult.
There are a bunch of helpful resources out there that will help you understand HTML tags, and get the data you need.
There might be a problem you want to dive into, so don’t let limited data be an issue!Helpful Resources:HTML Tags: Source 1 | Source 2BeautifulSoup: Source 1 | Source 2.