For machines it’s even harder; since language is highly flexible a sequence of words that you are looking for might not show up word-for-word in the passage.
With reading comprehension being so difficult, there’s no singular approach machines can take to solve the problem.
So, now what do we do?Let’s add a little…MACHINE LEARNING!Why don’t we employ the power of machine learning to help us solve this problem?Machine learning has emerged to be an extremely powerful technique in reading text and extracting important concepts from it; it’s been the obsession of most computational linguists for the past few years.
So let’s make this obsession a good one and put it to use in our problem!First, a brief detour: we’re going to be using the CNN/Daily Mail dataset in this project.
Take a look at an example document/query:( @entity1 ) it’s the kind of thing you see in movies, like @entity6’s role in “@entity7” or @entity9’s “@entity8.
” but, in real life, it’s hard to swallow the idea of a single person being stranded at sea for days, weeks, if not months and somehow living to talk about it.
miracles do happen, though, and not just in @entity17…Query: an @entity156 man says he drifted from @entity103 to @placeholder over a year @entity113Each document and query has undergone entity recognition and tokenization already.
The goal is to guess which entity should be substituted into “@placeholder” in order for the query to make sense.
Our goal now is to formulate this as an appropriate machine learning problem that we can train a model and use it to predict words correctly.
In spirit of this, we can actually form our problem as a binary classification problem; that is, given a new document and query pair, we can transform it into a set of new “document-query” pairs such that a certain subset of them correspond to correctly guessing an entity fits the blank, and the other subset corresponding to negative examples, i.
correctly guessing that an entity should not fill the blank.
For every document-query pair, we also create some features to associate with the pair, since feeding the entire pair into a machine learning model at this point is infeasible.
We employ a logistic regression model to implement this problem.
After training the model, we achieve an accuracy of 29%, meaning that 29% of the documents had the blank filled in correctly.
For context, most documents in the dataset contain about 25 entities, so randomly guessing a word for each document would have an accuracy of roughly 4%.
So this model performs pretty decently!The logistic regression model performs okay, but if we’re being honest, 29% accuracy isn’t exactly “human-like”.
So, how can we make our model learn more effectively?Here is where deep learning comes into play.
When humans read text, they don’t just learn a few heuristics about the text and then make guesses based off of those heuristics.
Rather, they learn to understand the underlying meaning of the text and to make meaningful inferences based off of their understanding.
That is our goal with this problem as well!.Deep learning will provide us with the tools we need to truly teach machines to read.
WIth a new approach comes new goals.
From now we on, we don’t want to limit this problem to that of a binary classification problem, rather we view this problem more holistically — our model will be allowed to choose any word in the document as the correct entity to “fill in the blank”.
This is more representative of actual learning then our previous formulation.
Behold…deep learningUsing these machine learning techniques is great and all, but can we do better than this?.On the one hand, logistic regression is an effective machine learning model to use and is a quick way to get a baseline accuracy, but it falls short on several aspects.
The way that logistic regression decides whether a word should fill in a blank is too rigid; namely, logistic regression can only learn linear functions, which aren’t that suitable for a wide range of problems.
This is where we can now turn to deep learning and the power of neural networks for our problem.
Neural networks are a recent hot development in machine learning that allow us to learn more complex functions than normal models like logistic regression can.
In this article, we’ll consider a special kind of neural network, called Long Short-Term Memory, or LSTM for short.
Here’s what the LSTM looks like:Seems complicated, but if we break it down piece by piece, we can understand what this network is doing.
Imagine reading over a sentence: word by word, your mind is thinking about the sentence as you read and is formulating thoughts little by little.
The same goes for an LSTM; it will take in each word of a sentence and generate a hidden state after seeing a word.
You can think of this hidden state as the thought that the LSTM transmits to the next time step when it comes time to read the next word.
In the context of our problem, we can feed the passage followed by the question into our LSTM, and finally guess which word would best fit the blank in the query based on the final output of the LSTM.
(If you’re wondering how to obtain the final word, the way we obtain this is by taking our output from our LSTM, and creating a list of probabilities for every possible word that can possibly fill the blank word.
We then choose the word that has the highest probability.
)What is special about the LSTM (versus other networks that have a similar structure) is that an LSTM is able to “remember” information about words over long ranges in the sentence, and be able to “forget” information quickly when necessary.
This allows an LSTM to determine what is important in looking at a certain word and what it needs to remember for previous words.
You may now ask: “how do we feed words into a network?” We could feed in the actual strings into the network, but it’s hard for neural networks to parse raw strings of data.
Instead, we represent each word using embeddings.
This involves representing each word using a fixed-length vector, so that it is easy for the LSTM to run computations on the words.
Ideally, we want words that have to do with each other to be “closer” to each other with respect to their embeddings.
Fortunately, the great minds at Stanford NLP have already done this task for us; they have a downloadable set of embeddings called GloVe (Global Vectors for Word Representation) that have proved to be very effective in natural language processing tasks.
We use these in our models, and achieve a stunning increase in accuracy: 39%!.This improvement over the base non-deep model signifies the power of deep learning in being able to model this task, as well as the power of the LSTM.
The linear regression and BiLSTM loss curves.
Note how the BiLSTM loss rate drops, this is a sign of fantastic learning!Can we do better?But now we again ask: can we do better?.The answer to this is also a strong yes!.At a high level, what we want to do is make our model think more like a human does.
As humans, when we perform a reading comprehension task, we don’t just read the text and then guess what should go in the blank; rather, we like to look at the query for clues as to which words in the document are more relevant and should be considered more closely.
Similarly, for our model we introduce the concept of attention.
The goal of attention is to generate a matrix whose values represent the relative “attention” the model should give to each document word.
Why Attention?The intuition for attention comes from the way humans think.
For example, when we perform a reading comprehension task, we use the query to guide our reading of the text.
We focus in on particular portions of the text that are more relevant for the query and ignore portions of the text that are irrelevant.
We want to do the same thing with our model.
We would like the machine to understand which portions of the text are relevant to the query and focus in on those portions.
The architecture described below attempts to do just that.
Our Architecture:An example model architecture using attention (Cui et al.
GRUs are basically just fancy LSTMs.
There are many ways to implement attention, but we chose a variant of the “Sum Attention” model described above.
As with in the above diagram, we begin feeding our document and query into separate GRU modules (GRUs are LSTMs with a few more bells and whistles).
The output of each GRU is then processed as follows (this is the attention part of the model!).
We first calculate what is known as “query to document attention”.
What this means is that the model tries to understand which of the document words are the most important given one word of the query.
But this returns a different value for each of the of the query words, so the question becomes: if each query word places different importance on different document words, how do we decide which query words we should actually listen to?To do this, our model now asks the question: given a particular document word, how important are each of these query words?.This gives us a measure of the importance of each query word.
We then average the importance of each query word to get a final importance.
This final importance is then combined with the previously calculated importance of different document words to calculate a final weighted average of attention for each document word.
The original “Attention over Attention” architecture.
We want to improve the process by which the document and query are initially combined.
Finally, we implemented our own innovation to try to improve this model further.
This model assumes that if a document word and a query word place high importance on each other, their representation will be very similar.
However, we attempt to improve on this by trying to let the model learn the importance relationship between the document and the query.
Similar to how we as humans figure out how much weight to place in our understanding of a text and how much weight to place in the connection of the text to the query, we also want to allow the machine to learn this relationship.
Our hope is that this gives the machine a more nuanced view of the text and the question being asked.
At the time of writing we have not yet managed to get good results for our current architecture.
We don’t believe that this is due to a flaw in our model, but rather a result of the difficulties in training complex deep learning models in general.
Among the most important lessons we take from this project is that building these models is extremely difficult and testing and bug-fixing can be very time-consuming.
We hope to be able to refine this model more in the future and achieve a better accuracy as the base model.
Lastly, the most important lesson we have taken from this project is the effectiveness of deep learning to teach machines to solve complex problems.
The original attention over attention model that we modeled our architecture off of had a final test accuracy of over 70%.
That’s a massive improvement over the non-deep baseline of 29%!.Deep learning shines in problems like this and we’re hoping to continue and bolster future work on this task to advance the field of machine comprehension further.
For a link to our paper that goes over more of the technical details, see this link.
Many thanks go out to Jeffrey Cheng, David Rolnick, and Konrad Kording for their Spring 2019 offering of CIS 700 (Deep Learning) at the University of Pennsylvania, and their constant teaching and mentorship throughout this project.