A Comparison Between Spacy NER & Stanford NER Using All US City NamesOri CohenBlockedUnblockFollowFollowingApr 8Sugarloaf Mountain, Rio De Janeiro, Brazil, using Kodak Ektar film.
Recently I had to do a POC with named entities, the immediate options were obviously NLTK, Stanford NER & Spacy.
I remember reading several comparisons that gave the edge to spacy and others that gave it to Stanford.
Seeing that there wasn’t a clear-cut case, I decided to do my own test.
io, our domain is govtech and our use-case for this POC was to identify as many US city-names as possible.
The idea is simple, load all US city-names and see which algorithm can identify more as a ‘location’.
Obviously, we can identify names in other categories such as persons or organizations, but that wasn’t the point.
Keep in mind that these classifiers were meant to be used on full sentences rather than sole entities, however, this is a good way to test, in a clean and controlled manner, how many are identified.
An official project with all the relevant files can be found in the following Github repository.
US city-namesI found a CSV file with US city names, it has 63211 cities, towns, etc.
Several US cities, taken out of the CSV file.
SpacyThe next step was to load Spacy and check if spacy recognized each city-alias as a geo-political-entity (GPE).
A GPE is spacy’s way of saying it’s a known country, city or state.
The following code shows exactly how to do this.
Other named entities can be found in the documentation.
Stanford’s NERThe next step is to use NLTK’s implementation of Stanford’s NER (SNER).
We start by loading the relevant libraries and point to the Stanford NER code and model files, we systematically check if the NER tagger can recognize all the city-names as ‘location’.
For this test, I decided to use the basic 3-class model that includes location, person and organization.
You can use the following code for this purpose.
A comparison of Spacy & Stanford’s NERConclusionIt’s quite surprising that the majority of reviews claim a small advantage in favor of either one.
On our domain the advantage is clearly big, we can identify nearly twice the amount of entities.
(*) However, it is a known fact that NLTK’s implementation of Stanford’s NER is slow.
According to my teammate Samuel Jefroykin, this can be accounted to the author’s decision to load the NER server every time you predict.
It took 24 hours to run the NLTK script, so please be advised.
Obviously, we can load the original java server, but this is an overhead for a quick POC.
Spacy is fun and fast to use and if you don’t mind the big gap in performance then I would recommend using it over NLTK’s implementation of Stanford’s NER.
I would like to thank Samuel Jefroykin, Yoav Talmi, Natanel Davidovits for proof-reading and comments.
Ori Cohen has a PhD in Computer science with focus in machine-learning.
He leads the data-science team in Zencity.
io, trying to positively influence citizen lives.