A couple tricks for using spaCy at scaleThe Python package spaCy is a great tool for natural language processing.
Here are a couple things I’ve done to use it on large datasets.
Schaun WheelerBlockedUnblockFollowFollowingApr 13Me processing text on a Spark cluster (artist’s rendition).
When a project I’m working on requires natural language processing, I tend to turn to SpaCy first.
Python has several other NLP pacakges, each with its own strengths and weaknesses.
I generally find spaCy to be fast and its conventions to be simple to learn.
One difficulty I’ve encountered with Spacy is the need to process large numbers of small texts.
For example, I recently had several hundred thousands different labels, each ranging from one to 15 words, that I decided I wanted to compare using word2vec.
I’ve been happy with spaCy’s word vectorization, but I also needed to process each string to first remove irrelevant words and parts of speech, lemmatize some tokens, and things like that.
It took a long time.
Here are two ways I sped things up.
Process everything as one document then split into spansThe intuitive way to process a bunch of separate records is to process them separately.
I started out by looping through each record, and calling spaCy’s nlp on each one.
I didn’t time how long it took to process everything, because I ran out of patience and just killed the process before it finished.
So I thought about it a little, read some more of the documentation, and came up with this:This joins all the texts into a single string, with each original text separated by some character sequence that is very unlikely to occur naturally.
I chose three pipes with a space on each side.
I then called nlp once.
It took a few seconds to process around 25,000 records.
After that, I could run through the list of tokens, identify special sequence of characters, and use that to split the document into spans.
These spans contain individual tokens, and have the attributes of documents such as word vectors.
Use SparkThis isn’t as simple as it sounds.
In my experience, the most expensive part of doing NLP is loading the corpus.
Once you have everything loaded, it’s just a matter of efficiently looking up the information you need.
If you’re using the largest English corpus spaCy offers, as I was, you face some difficulty getting that corpus onto your Spark executors.
For one thing, spaCy uses a different pickling library than PySpark does, which results in errors when trying to include the corpus in a user defined field — either as a broadcast variable or otherwise.
At any rate, it doesn’t necessarily makes sense to serialize the huge corpus, and then move it over to the executors, and then de-serialize it, when you can just load it on the executor itself.
That presents a new problem.
It takes a long time to load the corpus, so you want to minimize how many times you need to load it.
I’ve found it useful to follow a practice I’ve used in the past when deploying Scikit-learn models to Pyspark: group records into reasonably long lists of records, then call a UDF on the list.
That allows you to do the expensive thing once, at the cost of having to do a relatively small group of inexpensive things serially.
Here’s what I mean:So I randomly grouped my spark dataframe into 20 sections, collecting all of my texts into a list.
My UDF loads the corpus from spaCy, then runs through the texts to process them.
Two things to note:I process each document separately in a loop.
There’s no reason I need to do that.
I can use the first trick of processing all documents as a whole an then splitting into spans.
The two tricks aren’t mutually exclusive.
For my purpose, I only needed word vectors, so that’s all I returned.
I don’t know if PySpark would have a hard time moving entire spaCy document of span objects back from the executors.
I’ve never tried.
At any rate, I generally find it’s a good practice when using Spark to only get what you need, because moving stuff around is expensive.
So if you only need word vectors, write a udf that just returns those.
If you just need named entities or parts of speech of whatever, write a UDF that just returns those.
If you need all of those things, separate them out in the UDF and return them each as a separate field.
Using these two methods, processing large amounts of text has become much more efficient for me.