Increase your scraping speed with Go and Colly! — The BasicsLet’s scrape Amazon to see how fast this can be.
But first, let’s learn about the basics.
Jérôme MottetBlockedUnblockFollowFollowingJun 10IntroductionIn this article, we’ll explore the power of Go(lang).
We’ll see how to create a scraper able to get basic data about products on Amazon.
The goal of this scraper will be to fetch an Amazon result page, loop through the different articles, parse the data we need, go to the next page, write the results in a CSV file and… repeat.
In order to do this, we’ll use a library called Colly.
Colly is a scraping framework written in Go.
It’s lightweight but offers a lot of functionalities out of the box such as parallel scraping, proxy switcher, etc.
This article will cover the basics of the Colly framework.
In the next one, we’ll go more in details and implement improvements/optimizations for the code we’ll be writing today.
Let’s inspect Amazon to determine the CSS selectorsHere is how Amazon’s result page looks likeFrom this page, we would like to extract the name, the rating (stars) and the price for each product appearing in the result’s page.
We can notice that all the pieces of information we need for each product are in this area:With the help of the Google Chrome Inspector, we can determine that the CSS selector for those elements is “div.
Now, we just have to determine the selectors for the name, the stars, and the price.
All of those can be found thanks to the inspector.
Here are the results:Name: span.
a-price > span.
a-offscreenThose selectors are not perfect: we will see later that we’ll encounter some edge cases where we’ll need to format the values we extracted.
But for now, we can work with that.
The selector of the results list itself is “div.
So the logic for our scraper will be: “For each product in the results list, fetch its name, stars, and price”We’ll also handle the pagination in another section.
For now, we can just see that the URL of the results page looks like thishttps://www.