Forget APIs Do Python Scraping Using Beautiful Soup, Import Data File from the web: Part 2

Forget APIs Do Python Scraping Using Beautiful Soup, Import Data File from the web: Part 2Sahil DhankharBlockedUnblockFollowFollowingMar 31APIs are not always there for you for every website, but Beautiful Soup is going to stay with you forever to collect any kind of data from any website.

Souce: gurutechnolabs.

comToday, Data play a critical role in every industry.

And most of this data is coming from the internet.

Most company, invest millions of dollars in one tech to gain users without having profited from that invested product return.

The Internet is so vast that it contains more information about one topic than your nerdy professor.

 And the need & importance of extracting information from the web is becoming increasingly loud and clear.

Most time, when we are adding any information in your facebook, twitter, LinkedIn and giving feedback on Yelp, this information is treated as Data.

And this kind of data from the internet comes in many different ways example comment, restaurant feedback on Yelp, Twitter discussion, Reddit user discussion and Stock price etc.

You can collect all this data, organize it and analyze it.

That what we are going to talk about in this tutorial.

 There are several ways of extracting or importing data from the Internet.

You can use APIs, to retrieve information from any major website.

That’s what everybody is using these days to import data from the internet — all primary site like Twitter, Twitch, Instagram, Facebook provides APIs to access their website dataset.

And all this data available in a structured form.

 But most of the website doesn’t provide an API.

I think they don’t want us to use their user’s data or they don’t offer it because of lack of knowledge.

 So, In this topic, we are going to import data from the web without using any APIs.

But before we processed, please have a look at our part 1 of this series because everything connected like dots.

Something You don’t know about data File if you just a Starter in Data Science, Import data File…If you are new into Data Science field, then you must be working so hard to learn Data Science concept so fast.

Now…towardsdatascience.

comWhat is Beautiful SoupDon’t write that awful page ( Source: crummy.

com)Beautiful Soup is the best Library to scrap the data from a particular website or the Internet.

And it is most comfortable to work on also.

It parses and extracts structured data from HTML.

Beautiful Soup automatically transforms incoming texts to Unicode and outgoing versions to UTF-8.

You don’t have to remember about encodings except the document doesn’t define an encoding, and Beautiful Soup can’t catch one.

Then you have to mention the original encoding.

Rules: To run your program, please use Jupyter python environment to run your program.

Instead of running the whole program at once.

We are just taking precaution, so your program doesn’t break the website.

Please check out the website term and conditions before you start pulling out data from there.

Be sure you read the statement about the legal use of data.

Basic-Getting Familiar with HTMLHTML code plays an essential role in extracting data from the website.

So, before we processed, let us jump to the basic of the HTML tags.

If you have got a tiny bit of knowledge of HTML tags, you can move ahead to the next level.

<!DOCTYPE html> <html> <head> </head> <body> <h1> Learning about Data</h1> <p> Beautiful Soup</p> <body></html>This is the basic syntax of an HTML webpage.

Every <tag> serves a block inside the webpage:1.

<!DOCTYPE html>: HTML documents must start with a type declaration.

2.

The HTML document is contained between <html> and </html>.

3.

The meta and script declaration of the HTML document is between <head> and </head>.

4.

The visible portion of the HTML document is between <body> and </body> tags.

5.

Title headings are defined with the <h1> through <h6> tags.

6.

Paragraphs are defined with the <p> tag.

Other useful tags include <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns.

Let’s Check your HTML pageList of Asian countries by area – Wikipedianeeds additional citations for verification .

improve this article by adding citations to reliable sources.

Unsourced…en.

wikipedia.

orgLet us take a Wikipedia page to do the scrapping.

If you have google chrome, then go to the page, first right-click on it and open your browser inspector to inspect the webpage.

Inspect Wikipedia PageFrom the result you can see the table is inside the wiki table sortable and if you inspect it more you can find all of your table information there, it’s fantastic yeah!!!.

It’s going to be more amazing to see what you can do with beautiful soup.

Wikitable SortanbleLet’s Start Your DIY projectNow we know about our data and where it is located.

So, we are going to start scrapping our data.

Before we process, You need to install or import some libraries.

#Import Librariesfrom bs4 import BeautifulSoupimport requestsif you face any trouble in your installation, you can use sudo in front of every line.

RequestsIt is meant to be used by humans to communicate with the language.

This suggests you don’t have to manually join query strings to URLs, or form-encode your POST data.

Requests will enable you to send HTTP/1.

1 requests utilizing Python.

In it, you can combine content like headers, form data, multipart files, and parameters by through simple Python libraries.

It also enables you to obtain the response data of Python in the same way.

BS4 — BeautifulSoupBeautiful Soup is a Python library for extracting data out of HTML and XML files.

It works with your favourite parser to produce natural ways of operating, examining and transforming the parse tree.

It usually saves programmers hours or days of work.

# Specify with which URL/web page we are going to be scrapingurl = requests.

get(‘https://en.

wikipedia.

org/wiki/List_of_Asian_countries_by_area’).

textWe begin by studying the source code for a given web page and building a BeautifulSoup (soup)object with the BeautifulSoup function.

Now, we need to use Beautiful Soap function which will help us parse and apply with the HTML we fetched from our Wikipedia page:# import the BeautifulSoup library so we can parse HTML and XML documentsfrom bs4 import BeautifulSoupThen we are going to use Beautiful Soup to parse the HTML data that we have collected in our ‘URL’ variable, and we assign a different variable to store the data in Beautiful Soup format called ‘Soup.

’#Parse the HTML from our URL into the BeautifulSoup parse tree formatsoup = BeautifulSoup(url, "lxml")To get a concept of the structure of the underlying HTML in our web page, use Beautiful Soup’s prettify function and check it.

#To look at the HTML underlying to the web print(soup.

prettify())This is what we get from the prettify() function :<!DOCTYPE html><html class="client-nojs" dir="ltr" lang="en"> <head> <meta charset="utf-8"/> <title> List of Asian countries by area – Wikipedia </title> <script>If you visit this link and have a look to our Wikipedia page for the Asian countries, we can see there is little bit more information about the country areas.

The wikipedia table already setup-which make our work more easy.

Let’s have a look for it in our prettify HTML:And there it is,Beginning with an HTML <table> tag with a class identifier of “wikitable sortable.

” We will remember this class for future use.

If you go down in your program, you will see how the table is made up, and you will have a look at the rows begin and finish with <tr> and </tr> tags.

The first row of headers has <th> tags while the data rows underneath for every club has <td> tags.

Using the <td> tags that we are going to tell Python to secure our data from.

Before we go ahead, let’s work out some Beautiful Soup functions to demonstrate how it captures and can deliver data to us from the HTML website.

If we do the title function, Beautiful Soup will return the HTML tags for the heading and the content within them.

#To get the title of the pagesoup.

title()We can use this information to start preparing our attack on the HTML.

We know the data remains within an HTML table so firstly, we give Beautiful Soup off to retrieve all occurrences of the <table> tag within the page and add them to an array called all_tables.

# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variableall_tables=soup.

find_all("table")all_tablesUnder table class ‘wikitable sortable’ we have links with country name as the title.

# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variableMy_table = soup.

find('table',{'class':'wikitable sortable'})My_tableUnder table class ‘wikitable sortable’ we have connections with country name as the title.

Now, we are going to extract all the links within <a>, we used find_all().

links = My_table.

findAll('a')linksFrom the URL, we have to extract the title which is the name of countries.

To do that, we have to create a list Countries so that we can extract the name of countries from the link and add it to the list countries.

Countries = []for link in links: Countries.

append(link.

get('title')) print(Countries)Now, we have to convert the list countries into Pandas DataFrame to work in python.

import pandas as pddf = pd.

DataFrame()df['Country'] = CountriesdfIf you are interested in scrapping the data in high volume, you should consider using Scrapy, a powerful python scraping framework and also try to integrate your code with some public’s APIs.

The performance of data retrieval is significantly higher than scraping webpages.

For example, take a look at Facebook Graph API, which can help you get hidden data which is not shown on Facebook webpages.

Consider using a database backend like MySQL to collect your information when it gets too big.

And that carries us to the end of our Beautiful Soup tutorial.

Confidently, it provides you quite to get working on to examine some scraping out for your next project.

We’ve introduced request to fetch the URL and HTML data, Beautiful Soup to parse the HTML and Pandas to convert the data into a data frame for proper presentation.

You can find this tutorial notebook here.

If you have question, please feel free to ask.

In next tutorial we are going to talk about the APIs.

Feel Free to contact me on LinkedIn .

References: 1.

http://www.

gregreda.

com/2013/03/03/web-scraping-101-with-python/ .

2.

http://www.

analyticsvidhya.

com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/ .

3.

https://github.

com/stewync/Web-Scraping-Wiki-tables-using-BeautifulSoup-and-Python/blob/master/Scraping%2BWiki%2Btable%2Busing%2BPython%2Band%2BBeautifulSoup.

ipynb4.

https://en.

wikipedia.

org/wiki/List_of_Asian_countries_by_area5.

https://www.

crummy.

com/software/BeautifulSoup/.

. More details

Leave a Reply