Deep Learning is getting there.
Through the latest advances in sequence to sequence models, we can now develop good text summarization models.
Text Summarization can be of two types:1.
Extractive Summarization — This approach selects passages from the source text and then arranges it to form a summary.
One way of thinking about this is like a highlighter underlining the important sections.
The main idea is that the summarized text is a sub portion of the source text.
Abstractive Summarization -In contrast, abstractive approach involves understanding the intent and writes the summary in your own words.
I think of this as analogous to a pen.
Naturally abstractive summarization is the more challenging problem here.
This is one domain where machine learning has made slow progress.
It is a difficult problem since creating abstractive summaries requires good command of the subject and of natural language which can both be difficult tasks for a machine.
Also historically we didn’t have a good and big data set for this problem.
Data set here meaning source text with its abstractive summary.
Since humans need to write the summaries getting a lot of them is a problem except in one domain — News!When presenting news articles, professional writers continuously summarize information as shown in the snippet of CNN News below:News snippetStore highlights is a summary created for the bigger article.
News data from CNN and Daily Mail was collected to create the CNN/Daily Mail data set for text summarization which is the key data set used for training abstractive summarization models.
Using this data set as benchmark, researchers have been experimenting with deep learning model designs.
One such model that I love is the Pointer Generator Network by Abigail See.
I want to use this model to highlight the key components of a deep learning summarization model.
Before we get to the model lets talk about the metrics for evaluation of text summarization — Rouge Score.
Rouge score highlights the word overlap between the summarized and the source text.
Rouge 1 — measures single word overlap between source and summarized text whereas Rouge 2 measures bi gram overlap between source and summary.
Since rouge score metric only looks at word overlap and not readability of the text it is not a perfect metric as text with high rouge score can be a badly written summary.
Text Summarization through Deep LearningThe standard way of doing text summarization is using seq2seq model with attention.
See model structure below from the Pointer Generator blog.
Encoder-Decoder model architectureThere are three main aspects to a sequence to sequence model:1.
Encoder — Bi-directional LSTM layer that extracts information from the original text.
This is shown in red above.
The bi directional LSTM reads one word at a time and since it is a LSTM, it updates its hidden state based on the current word and the words it has read before.
Decoder — Uni-directional LSTM layer that generates summaries one word at a time.
The decoder LSTM starts working once it gets the signal than the full source text has been read.
It uses information from the encoder as well as what is has written before to create the probability distribution over the next word.
The Decoder is shown in yellow above with the probability distribution in green.
Attention Mechanism — Encoder and Decoder are the building blocks here but historically encoder decoder architecture in itself without attention wasn’t very successful.
Without attention, the input to decoder is the final hidden state from encoder which can be a 256 or 512 dimension vector and if we imagine this small vector can’t possibly have all the information in it so it became a information bottleneck.
Through attention mechanism, the decoder can access the intermediate hidden states in the encoder and use all that information to decide which word is next.
Attention is shown in blue above.
Attention is a pretty tricky concept so please don’t sweat if my brief description here was confusing.
You can read more about attention through my blog here.
As Pointer Generator paper shows that the above architecture is good enough to get started but the summaries created by it has two problems:1.
The summaries sometimes reproduce factual details inaccurately (e.
Germany beat Argentina 3–2).
This is especially common for rare or out-of-vocabulary words such as 2–0.
The summaries often repeat themselves.
Germany beat Germany beat Germany beat…)The pointer generator model solves these issues by creating a pointer mechanism that allows it to switch between generating text vs copying as is from source.
Think of the pointer as a probability scalar between 0 and 1.
If it is 1 then the model does abstractive generation of a word and if 0 it copies the word extractively.
Compared to the sequence-to-sequence-with-attention system, the pointer-generator network has several advantages:The pointer-generator network makes it easy to copy words from the source text.
The pointer-generator model is even able to copy out-of-vocabulary words from the source text.
This is a major bonus, enabling us to handle unseen words while also allowing us to use a smaller vocabulary (which requires less computation and storage space).
The pointer-generator model is faster to train, requiring fewer training iterations to achieve the same performance as the sequence-to-sequence attention system.
ResultsWe have implemented this model in Tensorflow and trained it on the CNN/Daily Mail data set.
The model obtained a Rouge-1 score of 0.
38 which is state of the art.
Our observation is that the model does really well in creating summaries from news articles which is the data it is trained on.
However if presented with a text that is not news it still creates good summaries but those summaries were more extractive in nature.
I hope you liked the blog.
To test out the pointer generator model, please pull their code at link.
Alternatively contact me to check out our version of this model.
I have my own deep learning consultancy and love to work on interesting problems.
I have helped many startups deploy innovative AI based solutions.
Check us out at — http://deeplearninganalytics.
You can also see my other writings at: https://medium.
dwivediIf you have a project that we can collaborate on, then please contact me through my website or at info@deeplearninganalytics.
orgReferences:Blog on Pointer Generator ModelDownloading CNN Daily Mail data set.