Visualisation of Information from Raw Twitter Data — Part 1

For this post I downloaded tweets about the Brexit for a period of about two days, quering the API for tweets with Brexit related hashtags like #Brexit, #brexit, #PeoplesVote or #MarchToLeave.

Open up a Jupyter Notebook and lets start coding:First, as always we need to import the libraries that are needed for the analysis and visualization:#Import all the needed librariesimport pandas as pdimport numpy as npimport matplotlib.

pyplot as pltimport jsonimport seaborn as snsimport reimport collectionsfrom wordcloud import WordCloudThen, we will read the data that we have collected and process it.

Remember that the output of the Streaming API is a JSON object for each tweet, with many fields that can provide very useful information.

In the following block of code, we read these tweets from the .

txt file where they are stored:#Reading the raw data collected from the Twitter Streaming API using #Tweepy.

tweets_data = []tweets_data_path = 'Brexit_tweets_1.

txt'tweets_file = open(tweets_data_path, "r")for line in tweets_file: try: tweet = json.

loads(line) tweets_data.

append(tweet) except: continueFor this to work, you will have to change the string tweets_data_path to the name of the document where you have stored your data.

I also suggest you create the notebook in the same folder where the downloaded tweets are stored, so that you don’t have to take into account any relative paths when trying to load the data.

While downloading the data, there might be connection issues or other kind of errors, so in the .

txt where we have stored the tweets there might be some lines that should be tweets but are error codes instead.

To get rid of them, run the following line of code after reading the .

txt file.

#Error codes from the Twitter API can be inside the .

txt document, #take them offtweets_data = [x for x in tweets_data if not isinstance(x, int)]Now, lets see how many tweets we have collected:print("The total number of Tweets is:",len(tweets_data))In my case, I downloaded 133030 tweets.

Okay, now that we have read the .

txt file and have our tweets in JSON format ready in the notebook, we will implement some functions to map some of the parameters of these JSON objects to columns of a Pandas dataframe where we will store the tweets from now on.

First, we will create a function that allows us to see if the selected tweet is a retweet or not.

This is done in a by evaluating if the JSON object includes a field called ‘retweeted status’ or not.

This function returns a boolean True value if the tweet is a retweet, and False if its not.

#Create a function to see if the tweet is a retweetdef is_RT(tweet): if 'retweeted_status' not in tweet: return False else: return TrueNow, we will do something similar to evaluate if the downloaded tweets are a reply to some other user’s tweets.

Again, this is done by checking for the absence of certain fields within the JSON object.

We will use the ‘in_reply_to_screen_name’ field, so that aside from seeing if the tweet is a response or not, we can see which user the tweet is responding to.

#Create a function to see if the tweet is a reply to a tweet of #another user, if so return said user.

def is_Reply_to(tweet): if 'in_reply_to_screen_name' not in tweet: return False else: return tweet['in_reply_to_screen_name']Lastly, the tweet JSON object includes a ‘source’ field, which is kind of an identifier of the device or application from which the tweet was posted.

Twitter does not provide an extensive guideline on how to map this source field to actual devices, so I’ve manually made a list with what I think the source information could be mapped to.

#Create function for taking the most used Tweet sources off the #source column def reckondevice(tweet): if 'iPhone' in tweet['source'] or ('iOS' in tweet['source']): return 'iPhone' elif 'Android' in tweet['source']: return 'Android' elif 'Mobile' in tweet['source'] or ('App' in tweet['source']): return 'Mobile device' elif 'Mac' in tweet['source']: return 'Mac' elif 'Windows' in tweet['source']: return 'Windows' elif 'Bot' in tweet['source']: return 'Bot' elif 'Web' in tweet['source']: return 'Web' elif 'Instagram' in tweet['source']: return 'Instagram' elif 'Blackberry' in tweet['source']: return 'Blackberry' elif 'iPad' in tweet['source']: return 'iPad' elif 'Foursquare' in tweet['source']: return 'Foursquare' else: return '-'Okay!.After creating all these functions we are ready to pass our tweets into a dataframe for easier processing.

Notice that some of the columns of the dataframe, extracted from the JSON objects do not require custom functions, as we just take the raw data from the JSON for these columns.

However, some of the columns of the dataframe do need some further explanation.

The ‘text’ column of the dataframe, is either filled with the normal ‘text’ field from the JSON object or with the ‘extended_tweet’ full text.

This is done because there are tweets with a text length in between 140 and 280 characters, and for them the ‘text’ field of the JSON object does not hold the entire text.

#Convert the Tweet JSON data to a pandas Dataframe, and take the #desired fields from the JSON.

More could be added if needed.

tweets = pd.

DataFrame()tweets['text'] = list(map(lambda tweet: tweet['text'] if 'extended_tweet' not in tweet else tweet['extended_tweet']['full_text'], tweets_data))tweets['Username'] = list(map(lambda tweet: tweet['user']['screen_name'], tweets_data))tweets['Timestamp'] = list(map(lambda tweet: tweet['created_at'], tweets_data))tweets['lenght'] = list(map(lambda tweet: len(tweet['text']) if'extended_tweet' not in tweet else len(tweet['extended_tweet']['full_text']) , tweets_data))tweets['location'] = list(map(lambda tweet: tweet['user']['location'], tweets_data))tweets['device'] = list(map(reckondevice, tweets_data))tweets['RT'] = list(map(is_RT, tweets_data))tweets['Reply'] = list(map(is_Reply_to, tweets_data))The ‘Username’ column of the dataframe depicts the user who posted the tweet, and all the other columns are pretty much self-explanatory.

If we take a look at the head of our dataframe, it should look something like this:tweets.

head()Awesome!.Our dataframe is built, and we are ready to explore the data!3.

Data Analysis and VisualisationFirst we will explore the different tweet categories that can be retrieved using the Streaming API.

Lets check out how many tweets are retweets for example:#See the percentage of tweets from the initial set that are #retweets:RT_tweets = tweets[tweets['RT'] == True]print(f"The percentage of retweets is {round(len(RT_tweets)/len(tweets)*100)}% of all the tweets")For me the percentage of retweets is 73% out of all the tweets.

This gives some interesting information about how Twitter works: most users do not post their own content, but rather forward content from other users.

If we take the RT_tweets dataframe and print its head we get something like this:RT_tweets.

head()From this dataframe, we can see the structure of the text of the tweets returned by the Streaming API when such tweets are retweets.

Such format is the following:“RT @InitialTweetingUser: Tweet Text”where the RT at the start indicates that such tweet is a retweet, @InitialTweetingUser is the Twitter username of the account who posted the original tweet, and Tweet Text is the text of said initial tweet.

With the Reply column we have created, we can also see how many tweets downloaded from the Streaming API are replies to tweets of another user:#See the percentage of tweets from the initial set that are replies #to tweets of another user:Reply_tweets = tweets[tweets['Reply'].

apply(type) == str]print(f"The percentage of retweets is {round(len(Reply_tweets)/len(tweets)*100)}% of all the tweets")For me the percentage of replies is about 7% of all the tweets.

Again, if we take a look at this Reply_tweets dataframe, we can see the structure of the replies returned by the Streaming API:As we can see, these replies have the following format:“@InitialTweetingUser Reply Text”where @InitialTweetingUser is the user who posted the tweet that is being replied to, and Reply Text is the reply.

Now lets see the percentage of tweets that have mentions but are not retweets.

Note that these tweets include the previous reply tweets.

#See the percentage of tweets from the initial set that have #mentions and are not retweets:mention_tweets = tweets[~tweets['text'].


contains("RT") & tweets['text'].


contains("@")]print(f"The percentage of retweets is {round(len(mention_tweets)/len(tweets)*100)}% of all the tweets")For me this accounts for 11% of the total tweets.

The tweets with mentions that are not replies or retweets are just tweets that include said mention somewhere in the middle of the text, like:‘Our working assumption remains that the UK is leaving on the 29th of March’, says @EU_Commission spox on #brexit.

Nice use of the verb ‘remain’.

Lastly, lets see how many tweets are just plain text tweets, with no mention or retweet:#See how many tweets inside are plain text tweets (No RT or mention)plain_text_tweets = tweets[~tweets['text'].


contains("@") & ~tweets['text'].


contains("RT")]print(f"The percentage of retweets is {round(len(plain_text_tweets)/len(tweets)*100)}% of all the tweets")For me this is about 15% out of the total of all the tweets.

Lets make a plot out of all of these categories to better compare their proportions:#Now we will plot all the different categories.

Note that the reply #tweets are inside the mention tweetslen_list = [ len(tweets), len(RT_tweets),len(mention_tweets), len(Reply_tweets), len(plain_text_tweets)]item_list = ['All Tweets','Retweets', 'Mentions', 'Replies', 'Plain text tweets']plt.



title('Tweet categories', fontsize = 20)plt.

xlabel('Type of tweet')plt.

ylabel('Number of tweets')sns.

barplot(x = item_list, y = len_list, edgecolor = 'black', linewidth=1)plt.

show()Awesome!.Lets discover which are the most used hashtags and the most mentioned users:#To see the most used hashtags.

hashtags = []hashtag_pattern = re.

compile(r"#[a-zA-Z]+")hashtag_matches = list(tweets['text'].


findall))hashtag_dict = {}for match in hashtag_matches: for singlematch in match: if singlematch not in hashtag_dict.

keys(): hashtag_dict[singlematch] = 1 else: hashtag_dict[singlematch] = hashtag_dict[singlematch]+1For this, we will use regular expressions (included in the Python Library re) to create a pattern for detecting a hashtag inside the text.

Then, we will create a dictionary with all the found hashtags, where the key is the hashtag text and the value is the number of times the hashtag has been posted.

#Making a list of the most used hashtags and their valueshashtag_ordered_list =sorted(hashtag_dict.

items(), key=lambda x:x[1])hashtag_ordered_list = hashtag_ordered_list[::-1]#Separating the hashtags and their values into two different listshashtag_ordered_values = []hashtag_ordered_keys = []#Pick the 20 most used hashtags to plotfor item in hashtag_ordered_list[0:20]: hashtag_ordered_keys.

append(item[0]) hashtag_ordered_values.

append(item[1])After this, we will sort the dictionary according to the value and separate the values and the hashtags into two different lists.

By doing this we can now plot the 20 most used hashtags, along with the number of times they appear:#Plotting a graph with the most used hashtagsfig, ax = plt.

subplots(figsize = (12,12))y_pos = np.


barh(y_pos ,list(hashtag_ordered_values)[::-1], align='center', color = 'green', edgecolor = 'black', linewidth=1)ax.



set_xlabel("Nº of appereances")ax.

set_title("Most used #hashtags", fontsize = 20)plt.


show()From this figure it can be clearly seen that #Brexit is the most used hashtag, which is pretty obvious, as it is one of the hashtags that was being used to download the tweets for the Brexit topic.

Another nice visual representation that can be made with this information is Wordcloud using the Wordcloud Python library.

Wordclouds are collages of different words that have a corresponding numerical value (like number of appearances) and that are represented scaled accordingly to this value: the words with the highest value will be the biggest ones in the collage.

#Make a wordcloud plot of the most used hashtags, for this we need a #dictionary #where the keys are the words and the values are the number of #appearanceshashtag_ordered_dict = {}for item in hashtag_ordered_list[0:20]: hashtag_ordered_dict[item[0]] = item[1]wordcloud = WordCloud(width=1000, height=1000, random_state=21, max_font_size=200, background_color = 'white').


figure(figsize=(15, 10))plt.

imshow(wordcloud, interpolation="bilinear")plt.


show()Wordcloud representation of the most used hashtagsLooks cool right?.Lets do the same with the mentions now.

#Now we will do the same with the mentions:mentions = []mention_pattern = re.

compile(r"@[a-zA-Z_]+")mention_matches = list(tweets['text'].


findall))mentions_dict = {}for match in mention_matches: for singlematch in match: if singlematch not in mentions_dict.

keys(): mentions_dict[singlematch] = 1 else: mentions_dict[singlematch] = mentions_dict[singlematch]+1Again, we use regular expressions to build the pattern for a mention and make a dictionary where the keys are the mentioned users and the values are the number of times they are mentioned.

Take into account that this mention pattern will also take mentions from the Retweets and Replies, so this dictionary will also include users who have not been explicitly mentioned, but also users whose posts have been retweeted or replied to.

#Create an ordered list of tuples with the most mentioned users and #the number of times they have been mentionedmentions_ordered_list =sorted(mentions_dict.

items(), key=lambda x:x[1])mentions_ordered_list = mentions_ordered_list[::-1]#Pick the 20 top mentioned users to plot and separate the previous #list into two list: one with the users and one with the valuesmentions_ordered_values = []mentions_ordered_keys = []for item in mentions_ordered_list[0:20]: mentions_ordered_keys.

append(item[0]) mentions_ordered_values.

append(item[1])Now, if we plot these results:fig, ax = plt.

subplots(figsize = (12,12))y_pos = np.


barh(y_pos ,list(mentions_ordered_values)[::-1], align='center', color = 'yellow', edgecolor = 'black', linewidth=1)ax.

set_yticks(y_pos )ax.


set_xlabel("Nº of mentions")ax.

set_title("Most mentioned accounts", fontsize = 20)plt.

show()From this we can see who is the most mentioned account: @theresa_may, the official account of Theresa May.

We can also see other political personalities in this chart like @jeremycorbyn (Jeremy Corbyn) or @Anna_Soubry (Anna Soubry), accounts belonging to political parties (@UKLabour, @Conservatives), news sources (@BBCNews, @BBCPolitics, @SkyNews), and different journalists (@georgegalloway)This insight could be of great use, and in future posts we will explore how to create networks using the different Twitter interactions, and see the role of these most mentioned users in these networks.

As for the hashtags, lets also make a WordCloud representation:#Make a wordcloud representation for the most mentioned accounts toomentions_ordered_dict = {}for item in mentions_ordered_list[0:20]: mentions_ordered_dict[item[0]] = item[1]wordcloud = WordCloud(width=1000, height=1000, random_state=21, max_font_size=200, background_color = 'white').


figure(figsize=(15, 10))plt.

imshow(wordcloud, interpolation="bilinear")plt.


show()Worcloud representation of the most mentioned usersConclusionWe have explored some interesting visualisations that can be obtained from raw Twitter data, without any kind of complex algorithms, and also studied the format of the responses from Twitter’s Streaming API.

On the Next post we will continue with some other cool visualisation techniques: see which users are posting the most tweets, the chance that they are Bots, we will also create a time series of the tweet publications, check out the devices where the tweets are being produced from, and get some further insights.

Feel free to follow me on Twitter: @jaimezorno, or contact me on Linkedin.

Thanks for reading, have a good day, and see you soon!.

. More details

Leave a Reply