Webscraping Sci-Fi movies from IMDB with PythonRiley PredumBlockedUnblockFollowFollowingJan 29I recently watched Interstellar again, which I love, and it inspired me to go to IMDB and look for more Sci-Fi films, which I also love.
I was then inspired to do a little webscraping to get at their data and reviews on all these great movies, and find some good movies to watch!Here’s how it went.
Looking at the website and trying to understand the parameters in the URI, I saw that there was an argument for start=1 and when I went to the next page it became start=51 and because there are 50 movies per page, after double checking that there were no other differences in the URI, it was clear that that’s what was changing.
So this knowledge came in handy when I called get().
I then initialized my variables and set the pages to cycle through those parameters previously mentioned.
Unfortunately as my comment says, after 10,000 movies, the pattern changes to after= followed by gibberish that is unique to each subsequent page.
So this will only go to 10,000 movies for now.
Next, I set up my for loop, which is two layers.
I will have to post it all in one chunk so the logic is all visible.
Here is the whole loop to scrape all the data.
For the page defined by the current iteration of the first loop, it cycles through each variable of interest of the 50 movies of that page and stores each respective element in the variable that is associated with it.
Once that loop is done, it returns to the previous loop to change the argument for the URI, thereby “clicking” to the next page.
And the cycle repeats.
This creates arrays of each variable containing each instance of that variable for each movie in the 10,000 movies that I was able to gather.
At the end I put them into a dataframe and it was off to do some data cleaning!With my newly created dataframe it was time to check it all out and make sure that nothing was weird before doing exploratory data analysis (EDA) which will come in part II of this article.
The first thing I noticed is that the year variable is in parentheses.
We don’t need that, we want an integer!.The numbers start from at least 1 character away from the end of the string and up to 5 characters away from the beginning of the string, so a little index slicing will do the trick.
Then I call .
astype to turn it to an int.
I noticed that the IMDB score is on a scale of 10 and the metascore a scale of 100, so I needed to standardize that variable, which I called ‘n_imdb’, a new column.
Runtime was a string in the form ‘number min’ so I needed to get just the number, and remove that space and ‘min’.
I don’t need the original runtime feature anymore so I dropped that and then went to start fixing the ‘genre’ column, which had a ‘.’ before the string which needed to be removed.
Wrapping the data cleaning portion up, I turned ‘n_imdb’ into integer type, and noticed that the ‘genre’ column still had issues, namely 17 whitespace characters after each string, so I used .
rstrip() on that, and split on the commas because I wanted lists in each row for genre so that I work with individual elements, since the genres were a list of three different genres.
I thought that would be useful for EDA to be able to be more granular.
Here’s the full data cleaning code:You’ll notice that I wrote it to csv at the end.
That is because this was all one .
My goal was to run this on an instance of EC2 on AWS to let them do the heavy lifting.
By my calculations, it would have taken 48 min on my measly computer and I didn’t feel like waiting.
This concludes the first step of the analysis.
In part II, I will do EDA and start to think about machine learning if applicable.
Check out the repo on my GitHub for the code!Happy coding,Riley.