We may be looking at some data, but it’s not very useful in its current format, some tweets have been truncated and we only have data from accounts we already follow.
If we want to build a structured data set for analysis we’ll have to dive deeper into the API calls output… put on your hard hat!Step 4: Inspecting a tweets jsonIn a new cell type and execute public_tweets to view the json associated with the first tweet, enjoy the output.
A messy output, hard to make sense offThis is no way to work through our data, try it for a few minutes and you’ll be rereading lines and giving yourself a headache in no time.
If you’ve tried scraping before you‘ve probably used pprint (pretty print) to format data like this in a more readable way.
Unfortunately pretty print can’t help us here because public_tweets is not a fundamental python data type, it’s type is tweepy.
Status, and the pprint module is not compatible.
We need to make the tweepy output more readable so we can figure out what code we need to write to extract the data we want.
We can do this as follows:Now you will be looking at something more structured:A clean output, a lot easier to parseWith this view, we can slowly work through the dictionaries and lists to determine what information is available and what keys are required to access them.
Step 5: Parsing out the dataIf you just want primary data on tweets (the text, id, user id), accessing the information is relatively easy and reliable, as we already saw, we can access the text of a tweet with status.
text when using the home timeline API call.
Depending on the nature of your project and the data you need, you will need to choose the appropriate API call to collect it.
The rest of this article focuses on using the user_timeline call, which (for a standard account) can retrieve up to 3200 most recent tweets of a user (as long as the profile is not locked).
Here’s where Tweepy gets a bit cumbersome.
The json structure returned by an API call changes depending on the characteristics of the tweet (think regular tweet, retweet or quote tweet).
Instead of providing a consistent format (where keys point to empty lists if no information is available), some keys are only present when there is data to point to.
The easy way to overcome this is to settle for primary tweet information, but if you care about retweets and quotes then your data will be incomplete.
If we want to build a rich dataset, we’ll need to use try and except statements to handle this inconsistency.
A good way to collect your data is by building a class that can make an API call and parses data in one go.
It might look something like this:A few notes on this code:Everything in the mined dictionary without a try statement is accessible regardless of the tweet type.
tweet_mode = 'extended' swaps the text index for full_text, and prevents a primary tweet longer than 140 characters from being truncated.
Unfortunately, retweets are still truncated, we can access the full retweet text with .
full_text, here we put in a try statement because it is only present when the tweet is a bonafide retweet.
Location data contains coordinates, city, and country, depending on what you want, you will need to parse this further, but it’s pretty easy.
result_limit and max_pages are multiplied together to get the number of tweets called.
Retweet and quote tweets are included in your 3200 tweet count regardless of whether you collect them or not (you won’t get more primary tweets by turning this setting off).
If you don’t provide a username when making a call using the class, you’ll be scraping tweets from the one and only wint, you’re welcome!Make sure you take time to review this code against your version of the structured json output we made earlier and two additional outputs, one for each tweet type (regular, retweet and quote).
By doing this you will gain a better understanding of how the data is accessed and be ready to create your own class for different calls or modify this class to collect additional data.
Step 6: Pick a project, identify the data you wantThis might have been step one for you, it might even be what led you to this article.
You probably won’t need much help here, but a few tips never hurt.
Before starting a project find out if there are any existing resources available.
Like every original data scientist, I picked up Tweepy with the intention of analysing politician’s tweets, with a quick google I found a list of Twitter handles for every UK member of parliament with a Twitter account, this saved me a lot of grunt work.
Be clear about your objectives, if it’s to learn, make sure you look at other projects, it’s good for motivation, getting ideas and can help point you in the right direction if you’re stuck.
If originality is your goal, again, look at other projects, you’re best idea may be an addition to an existing project, embrace the open source philosophy.
Once you have established who’s tweets you want to collect, collect all the handles into a list of strings (do not include the @ preceding every Twitter username).
Step 7: Collect that data!Start your session by importing the libraries you need:Copy the code for the class, substituting in your own API credentials.
Instantiate your miner, remember the default tweets collected is set to 20, you can alter this when you instantiate (the number of pages i.
multiplies of result_limit that are collected with a call, is specified when making the call):Now you’re all ready to make calls and have the data collected in a format you can use.
To make a call:And there you have it.
With a data frame of 3200 wint tweets, you’re just a bit of EDA away from understanding the meaning of life!The last thing to remember is there is a limit on API calls, for a standard user this is 900 requests every 15 minutes.
Don’t worry, one tweet isn’t one request, but it might be difficult to know when you’re about to go over.
For making multiple calls at once try this:Start making calls without the sleep timer, then if you get an error related to your call limit, play around with the sleep timer and counter if statement (that initiates the timer ever 25 calls) until you get something that works for you.
Finally, note that the index for the all_tweets data frame is repeated for each handle.
To make it unique for all tweets reset the index.
That’s everything!.You’re now ready to collect tweets on mass in a useable format.