Language Detection Benchmark using Production DataThis is a benchmark on real-life social media data for multilingual language detection algorithms.
Samuel JefroykinBlockedUnblockFollowFollowingJul 1The Tower of Babel by Pieter Bruegel the Elder (1563)As data scientists, we’re accustomed to processing many different types of data.
But when it comes to text-based data, knowing the language of the data is a top priority.
I experienced this challenge first hand when developing four language-based algorithms for English, Spanish, French, and Serbian.
Generally, the first step to dealing with multilingual NLP algorithms is: detecting the language of the text.
However, during production, the data is generally imbalanced between the different languages that need to be processed, which means it’s imperative to define the language of each data-point before sending it to the algorithm.
When it comes to any product that is dealing with a language imbalance in the data, clearly not “losing” data is essential to the product’s success.
To ensure that products don’t lose data, we, therefore, have to make sure that data isn’t addressed to the wrong algorithm as a result of an LD failure.
Understanding the data distribution is crucial to finding a good Key Performance Indicator (KPI) for any research project, and especially for this benchmarking project.
Sampling Data for the Test Set:For this benchmark, I wanted the test set to represent, as closely as possible, what I normally have in production.
I, therefore, selected 100K posts from Twitter that had a language parameter, including 95,780 English tweets, 1,093 Spanish, 2,500 French, and 627 Serbian.
This distribution represented approximately two weeks of data in production.
Moreover, in order to have the best test set possible, some annotators manually reviewed and corrected potential mistakes in the Twitter language parameters.
This means that I compared the tested models to this ground-truth.
I could have also eliminated the Twitter LD bias by properly labeling the data.
(Here is a good way to do it).
Testing Algorithms:For this Language Detection (LD) benchmark, I compared Polyglot LD, Azure Text Analytics LD, Google Cloud LD, and Fasttext LD.
In order to reproduce the following code, you need to pip install all the packages imported.
In addition, you will need to download FastText Algorithm, to create Azure API Credits and Google Cloud API JSON credits.
Results Analysis:The results of my benchmarking project showed that different models serve different metrics.
Accuracy: If accuracy is the most important metric, all of the models perform at practically the same level (Figure 1).
Figure 1: Global accuracy per modelHowever, in the case of my project, accurate language detection was not the only KPI that was important.
In addition, it was essential to prevent data loss because if data is lost, the product will fail.
Recall: The recall graph (Figure 2) shows that even though each model is very precise, Google’s model outperforms the rest in terms of recall, overall.
Figure 2: Recall by model for all languagesConclusion:Although the first instinct is to only look at accuracy metrics, which would lead someone to assume Azure is the best performer for language detection when other metrics come into play, other models may be preferable.
In this case, Google Cloud LD outperformed Azure (and all other models tested) in terms of recall, especially when the data was imbalanced between the languages being processed where there was one larger data set (English) and then significantly smaller data sets in the other languages (Spanish, French, and Serbian).
In the case of my particular project, where recall was the leading metric, Google Cloud LD was ultimately my LD model choice.
I would like to thank my fellow co-workers from Zencity, who played an integral part of this project: Inbal Naveh Safir, Ori Cohen, and Yoav Talmi.
Samuel Jefroykin is a Data Scientist at Zencity.
io, where he is trying to positively influence the quality of life in cities.
He also co-founded Data For Good Israel, a community dedicated to using the power of data for social issues.