Let’s decode some of them.
DT indicates that the word is a determiner.
NN a Noun, singular.
NNP a proper noun, singular.
CC a coordinating conjunction.
JJR an adjective, comparative.
RB an adverb.
IN a preposition.
How is that pos_tag is able to return all of these tags?It uses Probabilistic Methods.
Particularly, Conditional Random Fields (CRF) and Hidden Markov Models.
First, the model extracts a set of features of each word called State Features.
It bases the decision on characteristics like capitalization of the first letter, presence of numbers or hyphen, suffixes, prefixes, among others.
The model considers also the label of the previous word in a function called Transition Feature.
It will determine the weights of different features functions to maximize the likelihood of the label.
The next step is to perform entity detection.
This task will be carried out using a technique called chunking.
Tokenization extracts only “tokens” or words.
On the other hand, chunking extract phrases that may have an actual meaning in the text.
Chunking requires that our text is first tokenized and POS tagged.
It uses these tags as inputs.
It outputs “chunks” that can indicate entities.
An example of how chunking can be visualized.
NLTK has several functions that facilitate the process of chunking our text.
The mechanism is based on the use of regular expressions to generate the chunks.
We can first apply noun pronoun chunks or NP-chunks.
We’ll look for chunks matching individual noun phrases.
For this, we will customize the regular expressions used in the mechanism.
We first need to define rules.
They will indicate how sentences should be chunked.
Defining the rule of how sentences should be chunked.
Our rule states that our NP chunk should consist of an optional determiner (DT) followed by any number of adjectives (JJ) and then one or more pronoun noun (NNP).
Now, we create a chunk parser using RegexpParser and this rule.
We’ll apply it to our POS-tagged words using chunkParser.
The result is a tree.
In this case, we printed only the chunks.
We can also display it graphically.
Entities recognized in the text.
NLTK also provides a pre-trained classifier using the function nltk.
It allows us to recognize named entities in a text.
It also works on top of POS-tagged text.
Code showing how to use chunking for entity detection.
Entities recognized in the text.
As we can see, the results are the same using both methods.
However, the results are not completely satisfying.
Another disadvantage of NLTK is that POS tagging supports English and Russian languages.
2 SpaCy model: An open-source library in Python.
It provides an efficient statistical system for NER by labeling groups of contiguous tokens.
It is able to recognize a wide variety of named or numerical entities.
Among them, we can find company-names, locations, product-names, and organizations.
A huge advantage of Spacy is having pre-trained models in several languages: English, German, French, Spanish, Portuguese, Italian, Dutch, and Greek.
These models support tagging, parsing and entity recognition.
They have been designed and implemented from scratch specifically for spaCy.
They can be imported as Python libraries.
Importing spacy and pre-trained model in English.
And loaded easily using spacy.
In our code, we save it in the variable nlp .
SpaCy provides a Tokenizer, a POS-tagger and a Named Entity Recognizer.
So it’s very easy to use.
We just called our model in our text nlp(text).
This will tokenize it, tagged it and recognize the entities.
The attribute .
sents will retrieve the tokens.
tag_ the tag for each token.
ents the recognized entities.
label_ the label for each entity.
text just the text for any attribute.
We define a method for this task as follows.
Now, we apply the defined method to our original Wikipedia text.
Spacy recognizes not only names but also numbers.
Very cool, right?One question that probably raises is how SpaCy works.
Its architecture is very rich.
This results in a very efficient algorithm.
Explaining every component of SpaCy model will require another whole post.
Even tokenization is done in a very novel way.
According to Explosion AI, Spacy Named Entity Recognition system features a sophisticated word embedding strategy using subword features, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing.
Let’s explain these basic concepts step by step.
Word embedding strategy using subword features.
Wow!.Very long name and a lot of difficult concepts.
What does this mean?.Instead of working with words, we should represent them using multi-dimensional numerical vectors.
Each dimension captures the different characteristics of the words.
This is also referred to as Word embeddings.
The advantage is that working with numbers is easier than working with words.
We can make calculations, apply functions, among other things.
The huge limitation is that these models normally ignore the morphological structure of the words.
In order to correct this, the subword feature is introduced to include the knowledge about morphological structures of the words.
Convolutional neural network with residual connections.
Convolution networks are mainly used in processing images.
The convolutional layer multiplies a kernel or filter (a matrix with weights) by a window or portion of the input matrix.
The structure of the traditional neural networks is that each layer feeds the next layer.
A neural network with residual blocks splits a big network into small chunks.
This chunks of the network are connected through skip functions or shortcut connections.
The efficiency of a residual network is given by the fact that the activation function has to be applied fewer times.
This strategy uses sequential steps to add one label or change the state until it reaches the most likely tag.
Lastly, Spacy provides a function to display a beautiful visualization of the Named Entity annotated sentences: displacy.
Let’s use it!… Wrapping up!Named Entities Recognition is an on-going developing tool.
A lot has been done regarding this topic.
However, there is still room for improvement.
Natural Language Processing Toolkit — It is a very powerful tool.
It provides many algorithms to choose from for the same task.
However, it only supports 2 languages.
And it requires more tunning.
It does not support word vectors.
SpaCy — It is a very advanced tool.
It supports 7 languages as well as multilanguage.
It is more efficient.
However, the algorithms behind are complex.
Only keeps the best algorithm for a task.
It has support for word vectors.
Anyhow, Just go ahead and try an approach!.You will have fun!.