Natural Language Processing — Event ExtractionExtracting events from news articlesRodrigo NaderBlockedUnblockFollowFollowingMay 2The amount of text generated every day is mind-blowing.
Millions of data feeds are published in the form of news articles, blogs, messages, manuscripts and countless more, and the ability to automatically organize and handle it is becoming indispensable.
With improvements in neural network algorithms, significant computer power increase and easy access to comprehensive frameworks, Natural Language Processing has never been so explored.
One of its common applications is called Event Extraction, which is the process of gathering knowledge about periodical incidents found in texts, automatically identifying information about what happened and when it happened.
For example:2018/10 — President Donald Trump’s government banned countries from importing Iranian oil with exemptions to seven countries.
2019/04 — US Secretary of State Mike Pompeo announced that his country would open no more exception after the deadline.
2019/05 — The United States ended with exemptions that allowed countries to import oil from Iran without suffering from US sanctions.
This ability to contextualize information allows us to connect time distributed events and assimilate their effects, and how a set of episodes unfolds through time.
Those are valuable insights that drive organizations like EventRegistry and Primer.
AI, which provide the technology to different market sectors.
In this article, we’re going to build a simple Event Extraction script that takes in news feeds and outputs the events.
Get the dataThe first step for this is gathering the data.
This could be any type of text as long as it can be represented in a timeline.
I chose to use newsapi, since it’s an easy-to-use source of news and the developer plan is free up to 500 requests a day.
Following are the functions built to handle the requests.
This last function returns a list of approximately 2.
000 articles given a specific query.
Our purpose is to extract those articles’ events, so in order to simplify the process, I’m keeping only their titles (in theory, titles should already comprise the core message behind the news).
That leaves us with a data frame like the one above, including dates, descriptions, and titles.
Give meaning to sentencesNow that we have our titles ready, we need to represent them in a way that our algorithms understand.
Notice that I’m skipping a whole stage of pre-processing here, simply because that isn’t the purpose of this article.
But if you are starting with NLP, make sure to include those basic pre-processing steps before applying the models → here is a nice tutorial.
To give meaning to independent words and, consequently, whole sentences, we’ll use SpaCy’s pre-trained word embeddings models.
More specifically, SpaCy’s large model (en_core_web_lg), which has pre-trained word vectors for 685k English words.
Alternatively, you could be using any pre-trained word representation model (Word2Vec, FastText, GloVe…).
By default, SpaCy considers a sentence’s vector as the average between every word’s vector.
It’s a simplistic approach that doesn’t take into account the order of words to determine a sentence’s vector.
For a more sophisticated strategy, we could take a look at models like Sent2Vec and SkipThoughts.
This article about unsupervised summarization gives an excellent introduction to SkipThoughts.
For now, let’s stick with SpaCy’s method.
So each title will have a respective 300th-dimensional array, like this:Cluster those vectorsEven though we are filtering our articles by a search term, many topics can arise for the same query.
For example, searching for “Paris” could result in:Paris comes together after a devastating fireOr:Brazil football legend Pele admitted to hospital in ParisTo group articles from different topics, we’ll use a clustering algorithm.
In this particular case, I wanted to try the DBSCAN algorithm, because it doesn’t require us to previously specify the number of clusters.
Instead, it determines by itself how many clusters to create and their sizes.
The epsilon parameter determines the maximum distance between two samples for them to be considered as in the same neighborhood, meaning that if eps is too big, fewer clusters will be formed, but also if it’s too small, most of the points will be classified as not belonging to a cluster (-1), which will result in a few clusters as well.
Here is a chart showing the number of clusters by epsilon:Tunning eps value might be one of the most delicate steps because the outcome will vary a lot depending on how much you want to consider sentences as similar.
The right value will come up with experimentation, trying to find a value that preserves the similarities between sentences without splitting close sentences into different groups.
In general, since we want to end up with very similar sentences in the same cluster, the target should be a value that returns a higher number of classes.
For that reason, I chose a number between 0.
08 and 0.
Check out Scikit Learn documentation to find more about eps and other parameters.
Now we can check the size of each cluster:The -1 class stands for sentences with no cluster, while the others are cluster indexes.
If we analyze the biggest clusters, we find that those should represent the most important topics (or at least the most commented ones).
Let’s check one of the clusters:Transform to EventsWe end up with a data frame like the one above for each cluster.
Next step is to arrange those sentences in time and to filter them by relevance.
I chose to display one article per day so that the timeline is clean and consistent.
Since there are many titles about the same topic every day, we need a criterium to pick one among them.
It should be the sentence that best represents the event, one that comprises the core message which those titles refer to.
In order to achieve that, we can group the daily sentences, and for each group (or cluster), choose the one closest to the cluster center.
Here are the functions to find the central vector given a list of sentences.
Neat and tidy.
Finally, using Plotly, we can figure out a way to plot a handy timeline chart:That’s it.
000 articles we made a script to extract and organize events.
Now you can imagine how useful it may be to apply this to millions of articles every day.
Just take stock markets and the impact of daily news as an example and you’ll realize the value of Event Extraction.
Many steps could be included to improve the results, like properly pre-processing the data, including POS tagging and NER, applying better sentence to vector models, and so on.
But starting here, a desirable result can be reached very quickly.
Thank you for reading this post.
This was an article focused on NLP and Event Extraction.
If you want more about Data Science and Machine Learning, make sure to follow my profile and please feel free to leave any ideas, comments or concerns.