Understanding Entity Embeddings and It’s Application

Knowledge, Alfons Morales — UnsplashUnderstanding Entity Embeddings and It’s ApplicationHafidz ZulkifliBlockedUnblockFollowFollowingJan 27As of late I’ve been reading a lot on entity embeddings after being tasked to work on a forecasting problem.

The task at hand was to predict the salary of a given job title, given the historical job ads data that we have in our data warehouse.

Naturally, I just had to seek out how this can be solved using deep learning — since it’s a lot more sexier nowadays to do stuff in deep learning instead of plain ’ol linear regression (if you’re reading this, and since only a data scientist would ever be reading this post, I’m sure you’d understand 🙂 ).

And what better way to go about learning deep learning than to look for examples from fast.


Lo and behold — they actually do have an example of that.

An Introduction to Deep Learning for Tabular Data · fast.

aiThere is a powerful technique that is winning Kaggle competitions and is widely used at Google ( according to Jeff…www.


aiComplete with codes and all for a very similar problem, the Kaggle Rossman Challenge.

Guess that wraps up my task then.

Having ran through the codes and retrofitting it to my data though, there’s clearly a lot more that one needs to understand to be able to make it run as smoothly as the author had intended.

I was intrigued with the idea that one can not only represent words into vectors, but even those data that are time based? — it was news to me at that time.

I knew then that I had to look further into this notion of representing something as a vector, and what more can be accomplished using it.

Hence this post.

Entity EmbeddingWhat is it?Loosely speaking, entity embedding is a vector (a list of real numbers) representation of something (aka an entity).

That something (again, the entity), in Natural Language Processing (NLP) for instance, can be a word, or a sentence, or a paragraph.

“ The GloVe word embedding of the word “stick” — a vector of 200 floats (rounded to two decimals).

It goes on for two hundred values.


 [1]However, it really isn’t limited to just that.

Entities (the vector) can also be thought of as an object, a context or an idea, of a thing.

In the case of the popular Word2Vec model [8], that thing — are words.

In recent times though, researchers have taken the idea of creating embeddings given a context (recall that Word2Vec creates embedding given the context of it’s surrounding words) and applied it to other kinds of objects (of which I’ll go through later).

For example, in the diagram below, a visualization of a Twitter user entity embedding of demonstrates that users with similar characteristics are closer to each other in the given vector space.

From Twitter [2]Why is it important?In NLP, being able to represent words into embeddings allow us to use much less memory as the vector length would typically be much shorter than the language’s vocabulary (traditionally one could perform one-hot encoding to represent each word in the vocabulary).

In addition, the Word2Vec paper also showed that the embeddings also store some sense of semantic meaning of the word.

But more importantly, being able to represent entities (ie words, Twitter users, purchasing behaviour over time) into vectors opens up the possibility for us to perform various operation on it (the typical example being, using it as input to a machine learning model).

ApplicationsThe following are some examples of how some well known organizations are using it to their advantage.

It’s not meant to be exhaustive nor in-depth.

Readers are advised to check out the respective source materials (links provided in the reference section) should more detail be required.

Model featuresEmbeddings are mostly created first and foremost as a way to represent something in vector format, to be used in a deep learning model.

They’re useful to have since it allows us to capture and forward only the most salient information needed from the original data into our model.

A model using embeddings as input features will benefit from their encoded knowledge, and therefore improve performance.

On top of that, assuming compactness of the embeddings, the model itself will require fewer parameters, resulting in faster iteration speed and cost savings in terms of infrastructure during both training and serving.

[2]For text based data, there are many forms of model architectures that has been developed to better capture the different kinds of information that text can contain.

From mere one-hot encoding, TF-IDF, to neural based architectures like Word2Vec, GloVe to current state of the arts like ELMo, ULMFiT and BERT — today’s word embeddings have evolved from merely storing a binary yes or no status to capturing syntactic relationships and context.

For other types of information, like Twitter users, Pinterest pins or historical product sales revenue — a whole new kinds of model architecture are usually required.

For Instacart, in trying to optimize the efficiency of their personal shoppers; have built a deep learning model to predict the fastest sequence to sort the items of them.

For the model, they embed their store locations, shopping items and shoppers into a 10-dimensional vector space by using the Embedding method in Keras [3].

The store location embedding enables the model to learn store layouts and generalize learnings across retailers and locations.

The shopper embedding learns that shoppers may take consistently different routes through stores.

Looking deeper at the embedding layer afterwards reveal some interesting insights.

2-dimensional space representation of the embeddings using t-SNE [3]Most of these clusters (above) correspond to departments, even though the department data was never used to learn the embeddings.

Further, we can zoom into a region, like the blue meat and seafood department in the upper left.

There are other products that appear near the meat and seafood, but aren’t meat and seafood.

Instead, these are products (like spices, marinades, deli or other items) that are sold at the meat and seafood counter.

The model is learning the organization of the store better than the department and aisle encoding data we have.

[3]Similar technique was also used in the Taxi Destination Prediction Kaggle competition [5], where the authors opted to use the neural network approach to eventually win the competition.

Categorical data (ie Client ID, Taxi ID, and Stand IDD) are represented in 10-dimension.

Time is broken down to several bucket types and later embedded with the same dimension count.

Feature compressionIn the Taxi Destination Prediction example earlier, we saw that there were 6 items being embedded — the 3 different ID types and 3 time buckets.

If such model deployed in production, and as we continually add more and more features into the model to improve it’s performance — they will be a time when the time taken to process everything during inference time will be too slow.

As such, what we could do is to store the embeddings after the model have been trained, and load it back later (like Word2Vec embeddings).

Note though that one should only store embeddings that are mostly static in nature (like the IDs).

Effective usage of loading already pretrained embeddings would reduce the computation and memory load taken during inference time.

Nearest neighbour searchAs we saw earlier in Instacart’s visualized embedding, similar items are in fact closer to each other in the multi dimensional vector space that they exist in.

Exploiting this behaviour, we could in fact look for items of similar attributes given that we have a sample vector of what we’re interested in based on their distance.

Pinterest for instance, created a 128-dimensional embeddings for it’s pins (aka Pin2Vec) to capture the context of how each Pin relates to the Pinners [4].

It’s creation isn’t as straightforward (unlike Instacart or the Taxi competition) however, as they’ve adopted a similar method to Word2Vec in coming up with the represention.

“The learned Pin2Vec groups Pins with respect to a Pinner’s recent engagement” [4]Pin2Vec architecture is inspired by Word2Vec [4]The result was a more relevant recommendations as compared to the predecessor.

However, they are still use the latter for long tail Pins with sparse data.

In the Pinterest app, retrieving related pins (ie search result) isn’t only based on tapping a pin.

One can also make use of the visual search feature, or the text bar at the top of the app.

Can a nearest neighbour based search still be used in such cases?In his lecture at Berkeley, Dan Gillick of Google proposes that it could be done provided that we are able to place all of the different objects/entities, coming from text, images, video or audio; in the same vector space.

[6 (42:05) ]By training all the model together, we can ensure that the embeddings reside in the same space.

Considering the above diagram for example, there are 2 labelled dataset: (i) Question — Image and (ii) Question — Document.

By training both model datasets together we can ensure that the question, image and document — all exist in a single vector space.

Transfer learningTransfer learning is another common use of entity embeddings.

Essentially, what it means is that we train a model (ie.

a language model), and use it for another type of problem (ie.

text classification).

Training a classifier using BERT [1].

In general models that are built using a pretrained model are faster to train and can achieve better results.

In my current organization, we leverage BERT for one of our text classification task where it quickly achieved state of the art results for the given problem.

SummaryFrom merely representing words and it’s semantics, to representing time and spatial locations — there seem to be a clear advantage in being able to come up with good representations of entities into vectors.

The keyword there however is — “being able to”.

While there has been a lot of development in representing words or text in general into embeddings, the same cannot be said to other types of entities.

As we’ve seen in the case of Pin2Vec, coming up with the embedding does require some understanding of the problem and creativity in solving it.

And even then, it is important that we don’t make the assumption the learned representations are 100% correct.

Even Word2Vec, for all it’s hype, isn’t all that reliable in some of the cases.

In the diagram below for example, while the relationship Obama is to Barack as to Sarkozy is to Nicolas, the same can’t be said for Putin and Medvedev — who are two separate individuals.

For more details on the caveats of word embeddings, check out the article in [7].

Word2Vec paper [8]References (in no particular order)[1] Jay Alammar, The Illustrated BERT — http://jalammar.


io/illustrated-bert/[2] Embeddings@Twitter — https://blog.



html[3] Deep Learning with Emojis -https://tech.


com/deep-learning-with-emojis-not-math-660ba1ad6cdc[4] Applying deep learning to Related Pins — https://medium.

com/the-graph/applying-deep-learning-to-related-pins-a6fee3c92f5e[5] Artificial Neural Networks Applied to Taxi Destination Prediction — https://arxiv.



pdf[6] Embeddings for Everything: Search in the Neural Network Era — https://www.


com/watch?time_continue=36&v=JGHVJXP9NHw[7] Beyond Word Embedding Part 3 — https://towardsdatascience.

com/beyond-word-embeddings-part-3-four-common-flaws-in-state-of-the-art-neural-nlp-models-c1d35d3496d0[8] Efficient Estimation of Word Representations in Vector Space: https://arxiv.



pdfOther ReferencesBeyond Word Embeddings Part 2, Aaron Bornstein: https://towardsdatascience.

com/beyond-word-embeddings-part-2-word-vectors-nlp-modeling-from-bow-to-bert-4ebd4711d0ecPaper by 1st prize winner for the Kaggle Taxi Prediction: https://arxiv.



pdfPaper by 3rd prize winner for the Kaggle Rossman Challenge: https://arxiv.




. More details

Leave a Reply