Building a Search Engine with BERT and TensorFlowIn this experiment, we will use a pre-trained BERT model checkpoint to build a general-purpose text feature extractor.
Denis AntyukhovBlockedUnblockFollowFollowingJun 28T-SNE decomposition of BERT text representations (Reuters-21578 benchmark, 6 classes)These things are sometimes referred to as Natural Language Understanding (NLU) modules, because the features they extract are relevant for a wide array of downstream NLP tasks.
One use the features is in instance-based learning, which relies on computing the similarity of the query to the training samples.
We will illustrate this by building a simple nearest neighbour search engine, using the BERT NLU module for feature extraction.
The plan for this experiment will be:getting the pre-trained BERT model checkpointextracting a sub-graph optimized for inferencecreating a feature extractor with tf.
Estimatorexploring vector space with T-SNE and Embedding Projectorimplementing a nearest neighbour search engineaccelerating nearest neighbour queries with mathexample: building a movie recommendation systemQuestions and AnswersWhat is in this guide?This guide contains implementations of two things: a BERT text feature extractor and a nearest neighbour search engine.
For whom is this guide?This guide should be useful for researchers interested in using BERT for natural language understanding tasks.
It may also serve as a worked example of interfacing with tf.
What does it take?It should take around 30 minutes to complete this guide.
Show me the code.
The code for this experiment is available in Colab here.
Also, check out the repository I set up for my BERT experiments: it contains bonus stuff.
Now, let’s start.
Step 1: getting the pre-trained modelWe start with a pre-trained BERT checkpoint.
For demonstration purposes, I will be using the english model pre-trained by Google engineers.
For configuring and optimizing the graph for inference we will make use of the awesome bert-as-a-service repository.
This repository allows for serving BERT models for remote clients over TCP.
Having a single BERT-server is definitely beneficial in many environments.
However, in this part of the experiment we will focus on creating a local (in-process) feature extractor.
This is useful if one wishes to avoid additional latency and potential failure modes introduced by a client-server architecture.
Now, let us download the model and install the package.
zip!pip install bert-serving-server –no-depsStep 2: optimizing the inference graphNormally, to modify the model graph we would have to do some low-level TensorFlow programming.
However, thanks to bert-as-a-service, we can configure the inference graph using a simple CLI interface.
There are a couple of parameters there too look out for.
For each text sample, BERT encoding layers output a tensor of shape [sequence_len, encoder_dim], with one vector per token.
If we are to obtain a fixed representation, we need to apply some sort of pooling.
POOL_STRAT parameter defines the pooling strategy applied to the encoding layer number POOL_LAYER.
The default value ‘REDUCE_MEAN’ averages the vectors for all tokens in a sequence.
This strategy works best for most sentence-level tasks, when the model is not fine-tuned.
Another option is NONE, in which case no pooling is applied at all.
This is useful for word-level tasks such as Named Entity Recognition or POS tagging.
For a detailed discussion of effects of these options check out the Han Xiao’s blog post.
SEQ_LEN affects the maximum length of sequences processed by the model.
Smaller values will increase the model inference speed almost linearly.
Running the above command will put the model graph and weights into a GraphDef object which will be serialized to a pbtxt file at GRAPH_OUT.
The file will be 3 times smaller than the pre-trained model because the nodes and variables required for training will be removed.
This results in a very portable solution: for example the english model only takes 380 MB after exporting.
Step 3: creating a feature extractorNow, we will use the serialized graph to build a feature extractor using the tf.
We will need to define two things: input_fn and model_fninput_fn manages getting the data into the model.
That includes executing the whole text preprocessing pipeline and preparing a feed_dict for BERT.
First, each text sample is converted into a tf.
Example instance containing the necessary features listed in INPUT_NAMES.
The bert_tokenizer object contains the WordPiece vocabulary and performs the text preprocessing.
After that the examples are re-grouped by feature name in a feed_dict.
Estimators have a fun feature which makes them re-build and re-initialize the whole computational graph at each call to the predict function.
So, in order to avoid the overhead, to the predict function we will pass a generator, which will yield the features to the model in a never-ending loop.
model_fn contains the specification of the model.
In our case, it is loaded from the pbtxt file we saved in the previous step.
The features are mapped explicitly to the corresponding input nodes via input_map.
Now we have almost everything we need to perform inference.
Let’s do this!A standalone version of the feature extractor described above can be found in the repository.
>>> bert_vectorizer = build_vectorizer(estimator, build_input_fn)>>> bert_vectorizer(64*['sample text']).
shape(64, 768)Step 4: exploring vector space with ProjectorNow it’s time for a demonstration!Using the vectorizer we will generate embeddings for articles from the Reuters-21578 benchmark corpus.
To visualize and explore the embedding vector space in 3D we will use a dimensionality reduction technique called T-SNE.
Let’s get the article embeddings first.
The interactive visualization of generated embeddings is available on the Embedding Projector.
From the link you can run T-SNE yourself, or load a checkpoint using the bookmark in lower-right corner (loading works only on Chrome).
be/4XQAwhW6TLAStep 5: building a search engineNow, let’s say we have a knowledge base of 50k test samples, and we need to answer queries based on this data, fast.
How do we retrieve the sample, most similar to a query, from a text database?.The answer is nearest neighbour search.
Formally, the search problem we will be solving is defined as follows:given a set of points S in vector space M, and a query point Q ∈ M, find the closest point in S to Q.
There are multiple ways to define ‘closest’ in vector space, we will use Euclidean distance.
So, to build an Information Retrieval system for text, which is essentially a Search Engine, we will follow these steps:Vectorize all samples from the knowledge base — that gives SVectorize the query — that gives QCompute euclidean distances D between Q and SSort D in ascending order — providing indices of the most similar samplesRetrieve labels for said samples from the knowledge baseTo make the simple matter of implementing this a bit more exciting, we will do it in pure TensorFlow.
First we create the placeholders for Q and SDefine euclidean distance computationFinally, get the most similar indicesStep 6: accelerating search with mathNow that we have a basic retrieval engine set up, the question is: can we make it run faster?.With a tiny bit of math, we can.
For a pair of vectors p and q the euclidean distance is defined as follows:Which is exactly how we did compute it in Step 4.
However, since p and q are vectors, we can expand and rewrite this:where ⟨…⟩ denotes inner product.
In TensorFlow this can be written as follows:Due to the fact that matrix multiplication op is highly optimized, this implementation works slightly faster than the previous one.
By the way, in the formula above PP and QQ are actually squared L2 norms of the respective vectors.
If both vectors are L2 normalized, then PP = QQ = 1.
That gives an interesting relation between inner product and euclidean distance:However, doing L2 normalization discards the information about vector magnitude, which in many cases is undesirable .
Instead, we may notice that as long as the knowledge base does not change, PP, it’s squared vector norm, also remains constant.
So, instead of recomputing it every time, we can just do it once and then use the precomputed result, further accelerating the distance computation.
Now let us put it all together.
Example: movie recommendation systemFor this example we will use a dataset of movie summaries from IMDB.
Using the NLU and Retriever modules we will build a movie recommendation system that will suggest movies with similar plot features.
First, let’s get and prepare the IMDB dataset.
Vectorize movie plots with BERT NLUFinally, using the L2Retriever, find movies with plot vectors most similar to the query movie, and return it to user.
Let’s check it out!>>> recommend = buildMovieRecommender(names, X_vect)>>> recommend("The Matrix")Impostor Immortel Saturn 3 Terminator Salvation The Terminator Logan's Run Genesis II Tron: Legacy Blade RunnerEven without supervision, the model performs adequately on several classification and retrieval tasks.
While the model performance can be improved with supervised data, the described approach to text feature extraction provides a solid baseline for downstream NLP solutions.
This concludes the guide to building a search engine with BERT and TensorFlow.