Web Scraping For Finacial Analyst Beautifulsoup, Scrapy, Selenium & Twitter APISenthil EBlockedUnblockFollowFollowingMay 1IntroductionI was learning about web scraping recently and thought of sharing my experience in scraping using beautifulsoup, scrapy,selenium and also using Twitter API’s and pandas datareader.
Web scraping is fun and very useful tool.
Python language made web scraping much easier.
With less than 100 lines of code you can extract the data.
Web scraping is an important skill to acquire for data analytics professionals.
For example we can scrap financial information, macro economic information or any other information used by equity research analyst,venture capitalist, treasury managers, fund managers, hedge fund managers,etc .
By scraping web traffic from Alexa.
com, Goldman Sachs Asset Management was able to identify a sharp rise in visits to the Homedepot website and this helped the asset manager to buy the HomeDepot stock well in advance.
By tracking FlightAware.
com helps in finding out the travel patterns of the CEO’s which in turn helps in detecting M&A activity.
Web scraping is very important in finance since tons tons of finance information is available in the web.
What is Web Scraping ?Web Scraping is a technique to extract information from websites and process it.
Lets see how simple and easy to do web scraping.
I will go through the following libraries and API’sBeautiful SoupScrapySeleniumTwitter API to extract TweetsPandas DataReader to read from Google FinanceWe will be using Jupyter notebook.
Please check the below link and install it.
Jupyter InstallationTo know more about jupyterBeautiful Soup:Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Installing BS4:pip install beautifulsoup4 in the command prompt.
pip install requestsI already installed it and hence it says requirement already satisfied.
Once BS4 is installed open the juypter notebook.
Just for demo purpose I will scrap the data from nasdaq websiteTechnology CompaniesFind Technology Companies and a complete list of NASDAQ, NYSE, and AMEX listed companies using the Company List tool at…www.
comWe will be scraping the following infoNameSymbolMarket CapCountryIPO YearSubsectorThe web page looks like belowCheck Beautiful Soup DocumentationThe steps areSelect the web site URL to scrapFinalize the information needed to be scraped from the websiteGet RequestInspect the website — right click google chromeBeautiful Soup — HTML ParserSelect the data needed and append it to a listDownload the scraped data to a CSV file and store it locallyMove the data to pandas data frame or upload the CSV file to a dataframe and do further analysis and visualization in pandas.
The steps are explained in the below code.
Now open the jupyter notebook and start coding.
Import all the necessary libraries.
Get Request: First we need to download the webpage.
To do this we will be using get request.
url = “Give the URL you decided to scrape the data”response = get(url) To get more info on the get request3.
Parsing the HTML:We will create a new beautifulsoup object with the response from the above and the parser method has html.
In Chrome, we can quickly find selectors for elements byRight-click on the the element then select “Inspect” in the menu.
Developer tools opens and and highlights the element we right-clickedRight-click the code element in developer tools, hover over “Copy” in the menu, then click “Copy selector”Check the chrome inspect tool use.
For information on html and xml you can checkIn our case the whole list is saved in the table “CompanylistResults”.
It is straight forward to scrap the data from the table.
The elements of table are<table>The HTML <table> element represents tabular data — that is, information presented in a two-dimensional table comprised of rows and columns of cells containing data.
<td>The HTML <td> element defines a cell of a table that contains data.
It participates in the table model.
<tr>The HTML <tr> element defines a row of cells in a table.
The row's cells can then be established using a mix of <td> (data cell) and <th> (header cell) elements.
The HTML <tr> element specifies that the markup contained inside the <tr> block comprises one row of a table, inside which the <th> and <td> elements create header and data cells, respectively, within the row.
Common HTML Elements.
Check out for more infoThe whole list is stored in the table “CompanylistResults”The steps areFind the table ID = “CompanylistResults”Select all the ‘tr’ (Select all the rows)Loop over ‘tr’and find all the td in tr’s (td is the cell )Based on the HTML codes, the data are stored in after <tr>.
This is the row information.
Each row has a corresponding <td>.
</td> or cell data information.
I append the result to a list.
When I print the list I get the followingIn this case we scraped only the first page.
I see there are 14 pages.
To scrape multiple pages we can add the below logicInitialize a variable called pages with range between 1 and 14.
Loop for each page and assign the page value in the urlurl = ‘https://www.
aspx?industry=Technology&sortname=marketcap&sorttype=1&page=' + pagefor 2nd page the url will beurl=’https://www.
aspx?industry=Technology&sortname=marketcap&sorttype=1&page=2'and the whole extraction logic is coded in the for loop for each page.
To extract data:Now you canLoad the data to a pandas dataframeAssign column names like Name,Symbol etc to the dataframe.
Download the dataframe to a CSV file.
Do the analysis and visualizationThe code is below.
The downloaded CSV file looks like belowTop 10 Market Cap Stocks in Nasdaq:Top 20 Market Cap StocksTop 20 Market Cap Stocks in NASDAQNASDAQ Listed Stocks Countries :No of IPO’s/YearIPO YearSector wise no of companies listed in NASDAQSector wise breakupCompanies came IPO in 2019(Surprised not seeing Lyft in the list .
Checked NASADAQ website and Lyft is not in the list-I think they haven’t updated their database .
Pinterest and Zoom are there)Describe shows the followingPandas profiling shows more infoAlternative to web scraping is API’s.
Popular API’s Check out this github for useful API’sIn the next post I will cover the scrapy,selenium,Twitter API and the DataRead.
Thank you for reading my post.
.. More details