Determining the Happiest Cities using Twitter Sentiment Analysis with BERT

In this article, we examine the state-of-the-art technology in Deep Learning today to determine the positivity of users tweets in various cities around the globe.

Using Fast-Bert, a simple and minimalistic library wrapped around HuggingFaces PyTorch implementation of TensorFlow’s BERT algorithm, we performed multi-label classification on a dataset of tweets to rate them as either positive or negative.

By taking the ratio of positive tweets to negative tweets, we sorted the cities by their positive tweet ratio.

BERT is a state-of-the-art language model using a revolutionary new technique — using the Transformer architecture, it performs bidirectional training instead of the standard one-directional training — you can read more about it here if you wish:BERT Explained: State of the art language model for NLPAn approachable and understandable explanation of BERT, a recent paper by Google that achieved SOTA results in wide…towardsdatascience.

comIntroducing FastBert — A simple Deep Learning library for BERT ModelsA simple to use Deep Learning library to build and deploy BERT modelsmedium.

comOur PlanWe begin by collecting a dataset of tweets to train BERT on.

The way Fast-Bert and BERT works is that BERT is already trained on a large corpus of text, and this means it can perform well out-of-the-box on many Natural Language Processing tasks, such as sentiment analysis (our objective), Classification of text in general, Question Answering tasks, and Named Entity recognition.

All we must do is fine-tune (i.

e.

train it a little more) the model on the text we wish to use in its application.

In our case, we will train it on a dataset of Tweets, with the tweet text as our feature, and its sentiment category (positive or negative) as its label.

After training and testing our model, we shall collect recent tweets of users in various large cities around the world, and use our trained model to predict the sentiment of the tweets in order to rate the cities by tweet positivity.

Tools UsedPython 3Fast-Bert is an excellent simple wrapper for the PyTorch BERT module, very fast-ai inspired.

It makes data-combining and training writeable in under 3 lines of code.

PyTorch is required for Fast-Bert to train the modelapex is a PyTorch extension used for distributed training used by Fast-BertLinux system is required, as the PyTorch distributed training procedure implemented in Fast-Bert only works on Linux systems.

Tweepy is a clean wrapper for the Twitter APITwitter API Developer access account is required, as we use the Tweepy API to pull tweets.

Dataset of Tweets is used to fine-tune the BERT model.

The following dataset was used — cut down to just text and label.

https://www.

kaggle.

com/kazanova/sentiment140Paperspace Virtual Machine, if you do not have access to an Nvidia CONDA enabled GPU, you can use a virtual machine — it is very inexpensive.

Data ProcessingThe dataset downloaded from Kaggle is in the following formatWe must correct the formatting of this data as the function in Fast-Bert responsible for storing the data into a class — DataBunch — requires the format of [text, label].

We run the following processing steps with the Pandas libraryNow, we shuffle and split the dataset into Training and Validation sets.

A Test set is recommended, but optional.

We will split the data 70/30, Training and Validation, and save them back into .

csv format.

Training Bert, the beast.

We are getting closer to the good stuff now.

Data Processing is always tedious, but very necessary, and can certainly take a long time.

Luckily, training a model with Fast-Bert is very easy.

We begin with preliminary stepsTraining with Fast-Bert is very simple.

All we do is provide the metric calculation strategy, create a BertLearner object and call .

fit()Preparing for PredictionOur model trained to a 94% accuracy on the validation set.

Testing your model is also highly recommended, and testing on a dataset that is similar to the Tweets you’re predicting on, but this step was ignored for simplicities sake.

Since we saved our model into a file for the sake of computer memory, we need to use the following methods for prediction from the saved model.

Choosing Cities and Coordinate CollectionSince it would be unreasonable to collect data all across the world in every single possible location, the following 45 cities were chosen from large, well-known, and populated cities: Toronto, London, Moscow, Montreal, Ottawa, Vancouver, Hong Kong, Bangkok, Paris, New York, Kuala Lumpur, Istanbul, Dubai, Seoul, Rome, Taipei, Miami, Prague, Shanghai, Las Vegas, Milan, Barcelona, Amsterdam, Vienna, Venice, Los Angeles, Lima, Tokyo, Johannesburg, Beijing, Orlando, Berlin, Budapest, Florence, Warsaw, Delhi, Mexico City, Dublin, San Fransico, Saint Petersburg, Brussels, Sydney, Lisbon, and Cairo.

When we search for Tweets in a specific location with the Twitter API, we must provide three parameters: latitude, longitude, and the radius to search within.

To collect these three parameters for each city, we determined a ‘bounding box’ around the city, using bounding box software found at https://boundingbox.

klokantech.

com/, then determined the radius with Pythagorean’s Theorem.

I have included them here.

Note: the coordinates may be off center by a mean distance of 5kmUsing Tweepy to pull Tweets from CitiesUsing Tweepy, with a prebuilt class TweetHandler (can be found at https://gist.

github.

com/vdyagilev/66190688710a8e11aef57645251e84b0) we pulled tweets from the chosen cities.

We pulled 1000 Tweets from each city.

Then, we used another prebuilt class (can be found at https://gist.

github.

com/vdyagilev/14ff1ce525383f46b014aaead996fa1f) to translate the non-English tweet text into English.

This must be done since our BERT model was trained on English data, and this translation will not impact our sentiment analysis since no inherent meaning is lost.

Note: we must ensure our loop sleeps for 15 minutes each iteration, or else we shall get a Twitter Error 429 for sending too many requests.

Performing Sentiment Analysis on our Tweet DataHaving collected a dataset of Tweets, we can finally run our predictor on the Tweets.

We loop through each city and their corresponding Tweets, running the predictor on each Tweet, and collecting the count and determining the ratio of positive tweets / negative tweets for the city.

The finaleHere are the startling resultsAs we can see above, the happiest city by Twitter sentiment analysis appears to be Seoul, with an average for 3.

58 positive tweets for every negative tweet.

The saddest city seems to be Istanbul, with an average of only 1.

17 positive tweets for every negative tweet.

Shockingly, the mean ratio of tweet positivity is 2.

0, meaning users send 2 positive tweets to each negative tweet.

I expected this number to be much higher because remember, the Tweet dataset we trained our model on had two labels, positive and negative, and no neutral label, so many texts that would be considered neutral but were not, were labeled as positive.

However, after a quick analysis, we can probably conclude that we use Twitter most often when we’re venting, and we send hostile tweets often, but this is not a reflection of our true moods.

Ultimately, we have witnessed the power of BERT and the simplicity of Fast-Bert.

Only recently NLP was much worse, and not accessible to the general public.

Now, we have the ability to perform such powerful data analysis with tools available to the public.

If you’re interested in learning Deep Learning with Neural Networks, I highly suggest beginning with the fast-ai course taught by Jeremy Howard.

It teaches great engineering principles and promotes a project-based learning pathway that you dive in head first and learn the nitty-gritty technical details as questions arise.

.. More details

Leave a Reply