Written by Waldo’s Co-Founder & CTO Greg LampHow I Built a System to Track 15,000,000+ Prices a DayFrom sneakers to washing machines, learn how we built the brains behind our latest startup.
Ben SandersBlockedUnblockFollowFollowingMay 28Example of what kind of information we scrape from your favorite retailersBackstoryAfter selling off my first data science company, I decided to form a new startup that helps people effortlessly save money on the things they’ve recently purchased.
We call it Waldo, and it gets you cash back.
Waldo automatically facilitates price adjustments when items you’ve recently purchased go on sale within the company’s price protection window.
Building a GameplanBehind the brains of Waldo is a technical infrastructure capable of tracking over 15M+ online prices every day.
Today, we support over 70+ merchants, and that list grows every day.
Gotta Catch ‘Em AllFrom an engineering perspective, we knew that some domains would be harder to maintain due to the volume of products they sell online.
So we took a look at the distribution of retailers by product volume — and noticed the vast majority of domains had a modest number of products that need to be tracked.
Long-tail distribution of retailers by product volumeWe concluded the raw number of products wasn’t going to be an issue.
Instead, the challenge would be in maintaining so many unique domains.
The Method Behind the MadnessThus, in order to track 15M+ product prices, our problems generally fell into one of the following two categories:Breadth — tracking prices on 70+ domains without our maintenance costs exploding.
Frequency — the more frequently we get price updates, the more sales we’ll see and the faster we can process refunds.
Given this, the team decided to optimize for implementation time, reusability, and maintenance time over runtime efficiency and customization.
Giving Waldo a BrainWe started to build scrapers for each domain we wanted Waldo to support.
While the HTML might differ from domain to domain, our approach was the same:get a list of all the categories for each domainget all the products within each categoryThis meant the scrapers we built were pretty similar in fashion.
We were able to achieve our objective of code reusability which meant less cognitive overhead for the team.
Navs are an easy way to find all the categories for any given domainPythonWe decided on Python as our programming language for Waldo’s scraping stack.
There were two main considerations:“Top to Bottom, Left to Right” CentricWe originally toyed around with the idea of using node.
js (this is what we use for our web app).
While it would’ve been nice to have a singular programming language company-wide, we found it was tedious and error-prone when dealing with callbacks and async operations.
We tried using Python and the fit felt right.
LibrariesPython has an incredible scraping ecosystem.
Libraries like requests, BeautifulSoup, and our favorite, Scrapy.
When it comes to code reusability, Scrapy really shines.
It has a great set of defaults and a rich set of tooling allowing fairly junior developers to build effective scrapers in no time.
It also has built-in support for XPaths, which we use to parse and extract data from each product page.
ScrapinghubAnother benefit to Scrapy is that it’s professionally supported by the folks at Scrapinghub.
Not only does the team contribute new features to the open source Scrapy project, but they also manage and run their own SaaS product based on Scrapy.
Scrapinghub handles all of the infrastructures for our scrapers.
Everything from scheduling to auto-scaling.
It allows our team to focus on the business application of our scrapers, instead of the babysitting of jobs.
XPath HelperAnd last but surely not least is XPath Helper, a Chrome plugin that allows you to test out XPaths.
XPaths can get a bit unruly, so having a sandbox/console-style tool really helped with debugging our scrapers.
Type your XPath into the text box on the top right and see the results as you typeQuality ControlOur scrapers aren’t really worth much if we can’t monitor when they’re down.
We’ve found domains occasionally change layouts which in turn, renders our scrapers ability to detect product prices.
The two most common issues we see are:data integrity issues (i.
duplicate value, missing product price, etc.
)missing data (i.
a product is on the domain, but not in our database)Data IntegrityTo fight data integrity, we do a lot of checksums.
Checksums are simple checks we can query that identify the counts of records that don’t look quite right.
For example, checking to see if there are duplicate SKUs in the database.
We have a battery of checksums that we run every hour to validate everything looks good.
In this example, you can see there’s an issue with anthropolgie.
com missing price changesMissing DataSince we can only estimate how many products are on each domain, we have no idea knowing if we’ve scraped the entire catalog or not.
As a result, we occasionally miss products.
Typically these products are stashed away in some obscure category that we didn’t think to include in our scraper.
A good example is the way sites treat on-sale items.
Let’s say a wallet at Coach goes on sale.
This might trigger it to be removed from the Accessories category and added to the Sale, or even trickier, the Clearance category (yes Sale and Clearance are most often totally separate).
In this case, “Summer Sale” is its own entirely separate categoryTo combat this, we have a manual QA step where we sample products from the domain and validate that they’ve made it into our database.
But highly effective.
You’d be surprised how good humans are at finding products that a scraper might have missed ????Before You GoI hope you found this post helpful.
If you’re interested in seeing how we’ve implemented this tech, feel free to check us out at Waldo.
Or if you’re looking for more resources to get started on your own project, check out these handy links:ScrapinghubPuppeteer — headless Chrome that’s useful for scraping complex sitesReadypipe — competitor to ScrapinghubHow to scrape websites with Python and BeautifulSoupThere is more information on the Internet than any human can absorb in a lifetime.
What you need is not access to that…medium.