Making computers understand the sentiment of tweetsKristoffer Stensbo-SmidtBlockedUnblockFollowFollowingJan 11Understanding whether a tweet is meant as positive or negative is something humans rarely have problems with.
For computers, however, it is an entirely different story — complicated sentence structure, sarcasm, figurative language etc.
make it difficult for computers to judge the meaning and sentiment of a sentence.
However, automatically assessing the sentiment of a tweet would allow for large-scale opinion-mining of the population on all sorts of issues and could help understanding why certain groups of the population hold certain opinions.
On a more fundamental level, understanding the sentiment of text is a key part of natural language understanding and thus an essential task to solve if we want computers to be able to communicate efficiently with us.
In this blog post, I will present the results of a small research project carried out as part of the SoBigData project at the University of Sheffield.
We tested different approaches to processing text and analysed how much of the sentiment they are able to pick up.
Read on for a full tour of the project and the results!IntroductionThe aim of the project was to test how well computers can understand the sentiment of text using machine learning.
To do this, we fed the computer with lots of tweets that had each been labelled as having either positive, neutral, or negative sentiment by humans.
Each tweet also had an associated topic, which is important to make use of since a sentence can have very different sentiment depending on the topic discussed.
For instance, the word “high” is positive if we are talking about quality, but negative if we are talking about prices.
“Green” is positive when discussing environmental issues, but may be neutral when discussing art.
The task for the computer is now to predict the sentiment given a tweet and an associated topic.
How do computers read text?If you do not have experience with machine learning, this might seem like an odd question.
But machine learning is based on statistics, so anything a machine learning system is to work with has to be represented as numbers.
Turning text into numbers happens with so-called embedding models, and it is a major research field in itself to develop these.
An embedding model turns a word or a sentence into a vector, which is continuously adjusted during training such that words and sentences with similar meanings end up with similar vectors.
Ideally then, the vector should capture the meaning, context, sentiment etc.
of a sentence, but this is not an easy task at all, which is why many different embedding models have been developed.
Generally, newer models perform better, but they may also be tuned to specific tasks.
Full-blown machine learning systems capable of achieving state-of-the-art on, say, sentiment analysis are beasts.
They consist of multiple components, of which the text embeddings are only one, and it is generally very difficult to assess which parts of the systems are the performance bottlenecks.
Since any text needs to be represented as a vector for a machine learning system to be able to work with it, any analysis, including predicting the sentiment of a tweet, relies heavily on the chosen embedding model.
But that is not to say that other parts of the system might be equally important.
To make the role and contributions of text embeddings more transparent, we set out to test their performance for predicting sentiments with a system designed to be minimally obscuring.
How do we predict the sentiment?Our approach for predicting the sentiment is fairly simple and inspired by collaborative filtering.
Each tweet has an associated topic and it is essential that the sentiment is evaluated with respect to the topic (since a statement can easily be positive towards one aspect and negative towards another).
As both the tweet and the corresponding topic are represented by vectors with the same dimensionality, we can take the inner product of the two, giving us a single number representing the sentiment.
There is no reason that this should work with “raw” embeddings, so before taking the inner product, we learn and apply a transformation (further details later) to the topic vector space.
In this way, we can get the sentiment even when the topic has not been seen before.
We want to be able to predict three different kinds of sentiment (positive, neutral, negative), so we actually learn three different transformations of the topic space: one to predict positive sentiment, one to predict neutral sentiment, and one to predict negative sentiment.
When taking the inner product of the tweet with each of the three transformed topic vectors, we get three numbers which can be understood as the model’s bet on each of the sentiments — the higher the number, the more the model believes that this is the sentiment of the tweet.
Summary of projectWe want to test how much information different word embeddings carry for the sentiment of a tweet.
To predict the sentiment, we train a model that learns three transformations of the topic vector such that the inner product of the tweet and each of the three topic vectors will be the model’s vote for each of the three sentiments.
We have a few different choices to make.
Firstly, we have to choose which embedding models to test.
Secondly, we need to decide on how to transform the topic vectors.
Thirdly, we need a dataset of tweets that have been labelled with sentiment by humans, such that we have something to train and test the model on.
Deciding on the set-upThe datasetWe used the English dataset provided for SemEval-2017 Task 4.
This consists of about 26k tweets with various topics, all manually labelled with sentiment.
We keep the split defined by the task organisers, which is about 20k tweets for training and 6k tweets for testing on.
The embedding modelsWe chose to test the following four embedding models:Neural-Net Language Models (NNLM) from 2003, which is one of the earliest attempts at learning word embeddings with neural networks.
The model constructs 128 dimensional word vectors and will function as a kind of word embedding baseline, which the more advanced models should clearly beat.
Neural-Net Language Models as above, but now with normalised word vectors, which have sometimes been observed to produce better results.
Embeddings from Language Models (ELMo) from early 2018, which has been shown to achieve state-of-the-art results in many different tasks.
Constructs 1024 dimensional word vectors.
Universal Sentence Encoder (USE) from early 2018, a model trained to find word embeddings useful across many tasks.
Constructs 512 dimensional word vectors.
All four embedding models are conveniently available from TensorFlow Hub.
The transformation modelsChoosing a model for transforming the topic vector space is tricky.
On one hand, we would like to keep the original vector space as unchanged as possible.
On the other hand, we would like the transformation to be flexible enough that the information in the word embeddings can actually be used to predict the sentiment.
We therefore decided to test two different transformation models:A simple, affine transformation.
Such a transformation can only represent the most basic transformations like scaling, rotation, shear, and translation so, in some sense, this will test how much information the “raw” embeddings have captured.
A more complex transformation, represented by a neural network.
We use a neural network with two hidden layers, each 8 times the embedding dimensionality, ReLU activation functions and dropout.
The network takes as input the topic vector and outputs the transformed topic vector.
Such a transformation can warp the topic space in a highly nonlinear way, and it should therefore be able to obtain higher accuracies.
However, it will be more difficult to train and may be more prone to overfitting to the training set.
The final model will learn three transformations of each of the above types, corresponding to the three sentiments we want to predict.
Correcting for imbalances in the datasetIt is always challenging to work with real data.
In particular, if a single sentiment or topic is hugely overrepresented, the model might focus on this entirely during training, which will make predictions of other sentiments or using other topics way off.
Instead, we want to make sure that the model gives equal weight to all topics and sentiments, regardless of how frequent they are.
The effect of making these corrections is rather dramatical and a good lesson to keep in mind, so let’s spend a few minutes on this.
Imbalances in the datasetPlotting the number of tweets per sentiment the datasets shows large class imbalances.
Distribution of sentiment classes for both the training and the test set.
Especially positive sentiment is heavily overrepresented in the training data— in fact, almost 73% of the training tweets have positive sentiment.
This means that the model will benefit much more from learning to predict positive sentiment than any other.
Neutral sentiment, on the other hand, is associated with less than 10% of the tweets, and the model may simply learn to ignore this sentiment if it helps with predicting positive sentiment.
The distributions in the test set are strikingly different.
Negative sentiment is more abundant than positive, and no tweets have neutral sentiment.
This makes it even more important to make the model treat all sentiments equally.
Indeed, a test with the affine transformation model on NNLM shows that the trained model clearly favours positive sentiment due to its prevalence in the training data.
In this test, the topics in the training data were split into a training and an evaluation set of 90% and 10% of the topics, respectively.
A confusion matrix showing the actual sentiment of the tweets versus what the model predicted.
The percentages show how often a specific, actual sentiment was predicted to be any of the three sentiments by the model.
A perfect model would have 100% along the diagonal, meaning that the predictions are always correct.
Here, however, it is seen that the model often chooses to predict positive sentiment, regardless of what the actual sentiment is.
The figure shows a confusion matrix for the sentiment predictions with each column corresponding to a predicted sentiment.
Each row shows the actual sentiment, and for each of these rows, the number and colour of each matrix element shows the percentage of tweets with this actual sentiment that were predicted to have the sentiment shown in the columns.
Ideally, the diagonal should be close to 100%, meaning that the predicted sentiment was correct for almost all tweets, but even for the training set there are large off-diagonal elements.
It means that even when the model knows the correct sentiment, it prefers to default to predict positive sentiment most of the time.
43% of tweets with negative sentiment and more than 55% of tweets with neutral sentiment are predicted to have positive sentiment.
This is even worse for the evaluation set with 39% and 78%, respectively.
However, also the number of tweets per topic varies a lot in both the training and the test sets.
Number of tweets associated with each topic for both the training and the test set.
The topics have been sorted from left to right based on the amount of associated tweets and the their names have been omitted for clarity.
Especially for the training set we see a stark difference in the number of tweets per topic — some topics have more than 100 tweets, while roughly half the topics have about 20 tweets or less.
Going back to the test with the affine model and looking at the average accuracy of sentiment prediction for tweets with a given topic shows that topics with more tweets generally have higher accuracy.
The average accuracy of sentiment prediction for tweets in a given topic.
There is a clear tendency in that topics with more associated tweets generally achieve a higher average accuracy.
This tendency makes sense: the model benefits more from learning a transformation that works well for topics with more tweets.
But this is actually not what we want, because it means that the model may not generalise well.
We want the model to perform well on even unseen topics, and overfitting to a few topics will probably not help in this regard.
One way to deal with class imbalances like these is to weigh the penalty the model gets for a wrong prediction by the inverse of the frequency of the class.
This means that the model receives a larger error for less frequent data, thus paying more attention to these.
Let’s see how this affects the training of the model.
Correcting for sentiment imbalanceRetraining the model and penalising mistakes with the inverse of the sentiment frequency only, we already obtain a much better model.
Confusion matrices for the affine model on NNML, correcting for sentiment imbalances in the training set.
For the training set, the diagonal is close to 100% for all sentiments.
The predictions on the evaluation set also improved, though there is plenty of room for improvement.
We also see an improvement on the accuracy per topic for the training set, even though this was not explicitly encouraged.
Average topic accuracy for the affine model on NNML, correcting for sentiment imbalances in the training set.
Interestingly, the performance on the evaluation set appears to have decreased.
One explanation could be that most of the tweets in the evaluation set have positive sentiment, for which the model has now sacrificed some accuracy to perform better for the negative and neutral sentiments.
Correcting for topic imbalanceNext, let’s see what happens when penalising mistakes with the inverse of topic frequencies only.
This, too, results in much better sentiment predictions on the training set, which might be because weighing topics equally regardless of the number of tweets associated with them exposes the model to a larger variety of sentiments.
Confusion matrices for the affine model on NNML, correcting for topic imbalances in the training set.
But the real effect is seen when looking at the accuracy per topic.
For the training set, the accuracy is now pretty much independent of the number of tweets in a topic, with most topics being close to 1.
Average topic accuracy for the affine model on NNML, correcting for topic imbalances in the training set.
Correcting for both sentiment and topic imbalanceThe final model will weigh the penalty for a wrong prediction based on frequencies of both the sentiment and the topic.
This is done by simply multiplying the inverses of the topic frequency and the sentiment frequency and use the resulting quantity as the weight.
This should encourage the model to treat all sentiments and all topics equally during training.
The resulting model does indeed seem to be a good trade-off between accounting for sentiment and topic imbalance.
The sentiments are predicted fairly accurately, and the performance on the evaluation set hasn’t suffered.
Confusion matrices for the affine model on NNML, correcting for both imbalances in the training set.
The average accuracy per topic is again independent on the number of tweets associated with the topic.
Average topic accuracy for the affine model on NNML, correcting for both imbalances in the training set.
While correcting for class imbalances clearly helped on the training set, the performance on the evaluation set still did not change noticeably.
The model does not seem to be able to generalise well to new topics, which could mean that the affine transformation is too restrictive or that the training set is not very representative of the evaluation set.
We will return to that when taking a look at the final experiments.
Putting it all togetherNow, having accounted for class imbalances in the dataset as well as having decided on embedding and transformation models, we are ready to test the models and see how much sentiment information the word embeddings have been able to pick up.
The set-up follows the standard machine learning approach: we trained the model using a 10 fold cross-validation (CV) and evaluated the best model from each fold on the test set.
This gives us a measure of how much we can expect the performance of the model to vary when trained on (slightly) different datasets.
It is always a good idea to include some baseline experiments.
These should be the simplest approaches you can imagine and if your advanced model cannot beat these, you know something is wrong.
We chose two simple baselines: 1) use the most frequent sentiment from the training set (which will be “positive”) as the prediction for any tweet, and 2) use random sentiments from the training set as predictions.
The results from training all eight models and the two baselines and evaluating on the unseen test set are illustrated in the below figure.
The vertical lines through the data points indicate one standard deviation across the 10 CV folds.
There are a number of interesting observations to be done here.
Firstly, there is a large improvement for any embedding model over baselines.
The word embeddings therefore have, as expected, captured information that can be used to derive the sentiment of a tweet.
Secondly, turning to the NNLM embeddings, there doesn’t appear to be any improvement when using the nonlinear model compared to the affine model.
This is interesting because it suggests that embedding space is simple enough that the affine model is able to use all the sentiment information available in the embeddings.
This is in contrast to the newer embeddings, ELMo and USE, where we do observe an improvement when using the nonlinear model, suggesting that the embedding spaces learnt by these models are more complex.
For NNLM, the normalised vectors do have a tendency to perform better than the unnormalised ones, but the effect is nowhere near significant in our experiments.
Lastly, while ELMo and USE both contain much more information than the NNLM embeddings, they perform quite similarly in these experiments.
USE seems to generally contain slightly more information than ELMo, but not significantly more.
This is, however, still interesting since the USE embedding space is of much lower dimensionality than the ELMo space and, consequently, the models are much faster to train.
The takeawayIn this project, we wanted to test how much information different word embeddings carry about the sentiment of tweets.
We did this by constructing two models for predicting the sentiment that would be as nonintrusive as possible, enabling us to see how much sentiment information the raw word embeddings contain.
The results show that both old and new word embeddings certainly do carry information about the sentiment and that the newer embeddings, unsurprisingly, contain more than the older.
The results also show that a nonlinear transformation of the topic vectors perform significantly better than an affine transformation for the newer embeddings, suggesting that these spaces are more complex than for the older embeddings.
In conclusion, word embeddings generally do contain a lot of information about the sentiment of tweets, with newer embeddings containing significantly more information.
While not overly surprising, it emphasizes the importance of advanced embedding models for predicting the sentiment of tweets.
AcknowledgementsThe project was done as part of a SoBigData 2017 Short Term Scientific Mission (STSM) at the Department of Computer Science, University of Sheffield, in collaboration with Dr Diana Maynard.
A big shout-out to Dr Isabelle Augenstein for numerous discussions and advice during the entirety of this project.