Scraping Flight Data Using PythonThe nerdy way to plan your next weekend tripGregor HesseBlockedUnblockFollowFollowingApr 23Let’s say we want to plan our next weekend trip.
Plan is to go either to Milan or Madrid.
Point is we really don’t care, we’re just looking for the best option.
It’s a bit like the German soccer player Andy Möller once said: «Milan or Madrid, as long as it is Italy».
In a first step, we just look for a flight like we would usually do.
For this example we use Kayak.
Once we’ve entered our search criteria and set a few additional filters like «Nonstop», we can see that interestingly the URL in our browser has adjusted accordinglyWe can actually dissect this URL into different parts: origin, destination, startdate, endate and a suffix that tells Kayak to look only for direct connections and to sort the results by price.
Now the general idea is to get the information we want (e.
price, departure and arrival times) from the underlying html code of the website.
To do this, we mainly rely on two packages.
The first one is selenium, which basically controls your browser and automatically opens the website.
The second one is Beautiful Soup, which helps us to get the messy html code in a more structured and readable format.
From this «soup» we can later easily get the tasty bites we’re looking for.
So let’s get started.
First we need to set up selenium.
For this, we need to download a browser driver, e.
ChromeDriver (make sure it corresponds to your installed version of Chrome), which we have to put in the same folder as our Python code.
Now we load a few packages and tell selenium that we want to use ChromeDriver and let it open our URL from above.
Once the website has loaded, we need to find out how we can access the information that is relevant for us.
Let’s take e.
the departure time, using the inspect feature of our browser, we can see that the 8:55pm departure time is wrapped in a span with a class called «depart-time base-time».
If we now pass the website’s html code to BeautifulSoup, we can specifically search for the classes we’re interested in.
The results can then be extracted with a simple loop.
Since for each search result we get a set of two departure times, we also need to reshape the results into logical departure-arrival time pairs.
We use a similar approach for the price.
However, when inspecting the price element, we can see that Kayak likes to use different classes for its price information.
Therefore, we have to use a regular expression in order to capture all cases.
Also the price itself is further wrapped up, that’s why we need to use a few additional steps to get to it.
Now, we put everything into a nice dataframe and getAnd this is pretty much it.
We have scraped and put into shape all the information that was tangled up in the html code of our initial flight.
The heavy lifting is done.
To make things a bit more convenient, we can now wrap our code from above into a function and call that function by using different destination and starting day combinations for our three-day journey.
When sending several requests, Kayak might think from time to time that we’re a bot (and who can blame them), the best way to take care of this is by constantly changing the browser’s user agent and also by waiting a bit between the requests.
Our entire code would then look like this:Once we have specified all combinations and scraped the respective data, we can nicely visualize our results using a heatmap from seabornSo it’s decided.
Next stop: Madrid!.For just $108 it is the cheapest option of the three weekends we picked in September.
Looking forward to eating some delicious tapas.
.. More details