My journey of crawling websitesLow Wei HongBlockedUnblockFollowFollowingJun 16From a personal sharing who became a data scientistWhat do you think about scraping or crawling websites?A lot of people may view this skill as an automation tool, or just a skill which often people view as low end skill.
For me, you can view it as a war, just that this war is happen in the internet.
Especially when you are doing routine web crawling on a particular website, maybe for the first few times you are able to win, you will celebrate hell yeah this website is easy to scrape, but you haven’t realise on the other side, there are also people who can track your bot’s suspicious activity, and then trying to take down your bot from crawling the website.
Often times people may overlook the importance of retrieving good data source, but to put more emphasise on how to build a better accuracy’s machine learning model.
There is an acronym for building machine learning model, which is GIGO, short for “garbage in, garbage out”.
Therefore it is important to realise if you are able to retrieve good data source, plus you already equip with great EDA and modelling skills, you will be able to build a better machine learning model.
Back to the topic, I am going to share with you guys, my journey from scraping to crawling website.
Giving you guys some background of mine, I graduated from Nanyang Technological University in Singapore with a Degree of Mathematics and Economics, not coming from a technical background, it is often harder to learn technical skills, but if you are working hard, eventually you will be good at it.
First Experience in Scraping for AutomationIt started with a part time job at Institute of Statistic at Nanyang Technological University which I have to go to each website, copy and paste each page to the excel file when I was still studying in university.
Basically, I was required to get all the information related to world rank and score of each universities from this website.
After my part time job, my eyes always felt very tired.
That’s the reason I start to scrape website.
Nothing fancy, I just use Python library Request for scraping the website and BeautifulSoup to parse the html contents.
The result is great and which I found is really grateful for.
The scraper not only save my eyes for being so tired but also to improve my work’s efficiency.
Being the first time for scraping, I found out that it is really easy to scrape website using these two Python Packages.
If you are interested in this scraper, you can visit my github repo for more information.
Experience for crawling for a Data Science ProjectMachine learning is my interest.
It sparks my interest since my first internship at Dentsu Aegis Network Global Data Innovation Centre.
Given a chance to witness projects involved in machine learning in digital marketing, I am truly impressed by the power of it.
Therefore, to become a data scientist in the future, I told myself to do more projects involving machine learning.
I decided to do a project on prediction on rental price base on certain factors, for example distance between MRT and the rental unit, size of the room, number of bathroom in the unit and etc.
So, I decided to crawl property guruwebsite which is one of the most popular websites to find a rental unit in Singapore.
This website is a dynamic website which require me to build a bot that is interactive, therefore I chose Python package Selenium and BeautifulSoup to crawl the website.
At first, I thought it seems I am able to build a scraper and retrieve the data quite easily, but the website implemented Completely Automated Public Turing test to tell Computers and Humans Apart(Captcha).
There is where I realised crawling is not easy.
It involve deep understanding of the particular website so that you will be able to retrieve the data.
After putting much effort on understanding the possible reasons of being blocked by the website, I came out with a way to mimic human behaviour and finally it works like a charm.
Long story short, then I am able to apply machine learning models for prediction and the result seems pretty well after applying EDA to create multiple features.
This would not be possible if I am not able to get accurate and cleaned data for my machine learning model.
Experience for crawling during workAfter I graduated, I started my first job as a Business Intelligence position at Shopee.
I am responsible for crawling around 120k of items daily for competitor’s analysis.
Here is where I really improve vastly for my crawling skills.
My bot was once again got blocked by Captcha and that is the reason why I learn a new Python package for crawling, Scrapy.
It is definitely a great package for crawling.
For details of comparison between Scrapy and Selenium package, please feel free to visit this website: https://hackernoon.
com/scrapy-or-selenium-c3efa9df2c06Yeah I manage to solve it, but this time what I think I have learnt more in are listed as follow:Maintain a database for past data so that they can be used for analysis.
Build a dashboard to monitor several crawlers performance, so that I am able to amend the code as fast as possible when problems occured.
Techniques on how to bypass anti-scraping measure, or to create a more efficient crawler, for more information you can visit this website: https://hackernoon.
com/5-tips-on-creating-an-effective-web-crawler-85a82967709aTechniques to retrieve sensitive data, which may require you to use post method.
Final ThoughtI am currently working as a Data Scientist, and what I can inform you is that crawling is still very important.
I really hope this article will help and inspire you to solve some problems when facing difficulties in web crawling.
Thank you for reading this post.
Feel free to leave comments below on topics which you may be interested to know.
I will be publishing more posts in future about my experiences and projects.
About AuthorLow Wei Hong is a Data Scientist at Shopee.
His experiences involved more on crawling websites, creating data pipeline and also implementing machine learning model on solving business problems.
He provides crawling services which is able to provide you the accurate and cleaned data which you need.
You can visit this website to view his portfolio and also to contact him for crawling services.
You can connect with him on LinkedIn and Medium.