Text Summarization with PythonUmer FarooqBlockedUnblockFollowFollowingMar 28The is the Simple guide to understand Text Summarization problem with Python Implementation.
Table of contentMotivationWhy text summarization is important?What is Summarization?Extractive MethodAbstractive MethodText Summarization using PythonFurther ReadingMotivationTo take the appropriate action, we need the latest information.
But on the contrary, the amount of information is more and more growing.
There are many categories of information (economy, sports, health, technology…) and also there are many sources (news site, blog, SNS…).
So to make an automatically & accurate summaries feature will helps us to understand the topics and shorten the time to do it.
Why text summarization is important?Moved by the cutting edge advancement and Innovation, Data is to this century what oil was to the last one.
Today, our reality is parachuted by the gathering and dissemination of huge amounts of data.
In fact, the International Data Corporation that the total amount of digital data circulating annually around the world would sprout from 4.
4 zettabytes in 2013 to hit 180 zettabytes in 2025.
That’s a lot of data!With such a huge amount of data circulating in the digital space, there is a need to develop algorithms that can automatically shorten large huge texts and summaries that information that can fluently pass the intended messages.
So, What is Summarization?Basically, we can regard the “summarization” as the “function” its input is document and output is summary.
And its input & output type helps us to categorize the multiple summarization tasks.
Single document summarization[ summary = summarize(document)]2.
Multi-document summarization[summary = summarize(document_1, document_2, …) ]We can take the query to add the viewpoint of summarization.
Query focused summarizationsummary = summarize(document, query)This type of summarization is called “Query focused summarization” on the contrary to the “Generic summarization”.
Especially, a type that set the viewpoint to the “difference” (update) is called “Update summarization”.
Update summarizationsummary = summarize(document, previous_document_or_summary)And the “summary” itself has some variety.
Indicative summaryIt looks like a summary of the book.
This summary describes what kinds of the story, but not tell all of the stories especially its ends (so indicative summary has only partial information).
Informative summaryIn contrast to the indicative summary, the informative summary includes full information of the document.
Keyword summaryNot the text, but the words or phrases from the input document.
Headline summaryOnly one line summary.
ApproachThere are mainly two ways to make the summary.
Extractive and Abstractive.
Extractive MethodSelect relevant phrases of the input document and concatenate them to form a summary (like “copy-and-paste”).
Pros: They are quite robust since they use existing natural-language phrases that are taken straight from the input.
Cons: But they lack in flexibility since they cannot use novel words or connectors.
They also cannot paraphrase like people sometimes do.
Now I show some categories of extractive summarization.
Graph BaseThe graph base model makes the graph from the document, then summarize it by considering the relation between the nodes (text-unit).
TextRank is the typical graph-based method.
TextRankTextRank is based on PageRank algorithm that is used on Google Search Engine.
It's base concept is “The linked page is good, much more if it from many linked page”.
The links between the pages are expressed by matrix (like Round-robin table).
We can convert this matrix to the transition probability matrix by dividing the sum of links in each page.
And the page surfer moves the page according to this matrix.
Feature BaseThe feature base model extracts the features of the sentence, then evaluate its importance.
Here is the representative research.
Sentence Extraction Based Single Document SummarizationFollowing features are used in the above method.
Position of the sentence in the input documentPresence of the verb in the sentenceLength of the sentenceTerm frequencyNamed entity tag NEFont style…etc.
All the features are accumulated as the score.
of coreferences are the number of pronouns to the previous sentence.
It is simply calculated by counting the pronouns occurred in the first half of the sentence.
So the Score represents the reference to the previous sentence.
Now we can evaluate each sentence.
Next is selecting the sentence to avoid the duplicate of the information.
In this paper, the same word between the new and selected sentence is considered.
And the refinement to connect the selected sentences are executed.
Luhn’s Algorithm is also feature base.
It evaluates the “significance” of the word that is calculated from the frequency.
You can try feature base text summarization by TextTeaser (PyTeaser is available for Python user).
Topic BaseThe topic base model calculates the topic of the document and evaluates each sentence by what kinds of topics are included (the “main” topic is highly evaluated when scoring the sentence).
Latent Semantic Analysis (LSA) is usually used to detect the topic.
It’s based on SVD (Singular Value Decomposition).
The following paper is a good starting point to overview the LSA(Topic) base summarization.
Text summarization using Latent Semantic AnalysisThe simple LSA base sentence selectionThere are many variations the way to calculate & select the sentence according to the SVD value.
To select the sentence by the topic(=V, eigenvectors/principal axes) and its score is most simple method.
If you want to use LSA, gensim supports it.
Grammer BaseThe grammar base model parses the text and constructs a grammatical structure, then select/reorder substructures.
Title Generation with Quasi-Synchronous GrammarThis model can produce meaningful “paraphrase” based on the grammatical structure.
Abstractive MethodGenerate a summary that keeps original intent.
It’s just like humans do.
Pros: They can use words that were not in the original input.
It enables to make more fluent and natural summaries.
Cons: But it is also a much harder problem as you now require the model to generate coherent phrases and connectors.
Extractive & Abstractive is not conflicting ways.
You can use both to generate the summary.
And there is a way to collaborate with the human.
Aided SummarizationCombines automatic methods with human input.
The computer suggests important information from the document, and the human decide to use it or not.
It uses information retrieval, and text mining way.
Encoder-Decoder ModelThe encoder-decoder model is composed of encoder and decoder like its name.
The encoder converts an input document to a latent representation (vector), and the decoder generates a summary by using it.
Nowadays, the encoder-decoder model that is one of the neural network models is mainly used in machine translation.
So this model is also widely used in abstractive summarization model.
If you want to try the encoder-decoder summarization model, tensorflow offers the basic model.
Combination ApproachNot only one side of extractive or abstractive, combine them to generate summaries.
Pointer-Generator NetworkCombine the extractive and abstractive model by switching probability.
Summarization with Pointer-Generator NetworksThis is a hybrid network that can choose to copy words from the source via pointing while retaining the ability to generate words from the fixed vocabulary.
you can find it’s Implementation hereDeep Reinforced ModelIt’s a neural network model with a novel intra-attention that attends over the input and continuously generated output separately, and a new training method that combines standard supervised word prediction and reinforcement learning (RL).
Text Summarization using PythonGensimgensim.
summarization offers TextRank summarizationfrom gensim.
summarizer import summarizeprint(summarize(text))gensim models.
lsimodel offers topic modelfrom gensim.
utils import common_dictionary, common_corpusfrom gensim.
models import LsiModelmodel = LsiModel(common_corpus, id2word=common_dictionary)vectorized_corpus = model[common_corpus]TextTeaserTextTeaser is an automatic summarization algorithm that combines the power of natural language processing and machine learning to produce good results.
>>> from textteaser import TextTeaser>>> tt = TextTeaser()>>> tt.
summarize(title, text)PyTeaserPyTeaser takes any news article and extracts a brief summary from itSummaries are created by ranking sentences in a news article according to how relevant they are to the entire text.
The top 5 sentences are used to form a “summary”.
Each sentence is ranked by using four criteria:Relevance to the titleRelevance to keywords in the articleThe position of the sentenceLength of the sentence>>> from pyteaser import SummarizeUrl>>> url = 'http://www.
html'>>> summaries = SummarizeUrl(url)>>> print (summaries)Textrankpytextrank is the Python implementation of TextRank.
In this Notebook you can find the complete implementation of pytextrankTensorFlow summarizationThe core model is the traditional sequence-to-sequence model with attention.
It is customized (mostly inputs/outputs) for the text summarization task.
The model has been trained on Gigaword dataset and achieved state-of-the-art results (as of June 2016).
Further ReadingAutomatic summarizationRecurrent Neural Networks for Better SummarizationText summarization with TensorFlowHas Deep Learning been applied to automatic text summarization (successfully)?Automatic Text Summarization, 2014.
Innovative Document Summarization Techniques: Revolutionizing Knowledge Understanding, 2014.
Taming Recurrent Neural Networks for Better Summarization, 2017.
Deep Learning for Text SummarizationA Gentle Introduction to Text SummarizationA Quick Introduction to Text Summarization.. More details