How to build a Recommendation Engine quick and simple

With the size of the Zoopla catalogue it is really difficult to generate a significant amount of co-occurrence even at Zoopla’s traffic volumes.

You do not depend on recommendations for the discovery of new products.

Because collaborative filtering requires co-occurrence to generate signals the algorithm has a big cold start problem.

Any new item in the product catalogue has no co-occurrence and cannot be recommended without some initial engagement of users with the new item.

This could be acceptable if your business uses e.


a lot of CRM and marketing as a strategy to promote new products.

Collaborative Filtering Quick and SimpleOne option would be to use Spark and the alternating least squares (ALS) algorithm (link) which is a simple solution for model training but does not provide an immediate solution for deployment and scoring.

I recommend a different approach to get started:As it turns out the maths of search and recommendation problems are strikingly similar.

Most importantly, a good user experience in search and recommendations are almost indistinguishable.

Basically, search results are recommendations if we can formulate recommendations as search queries.

It’s an ideal solution as many websites and businesses already operate search engines in their backends and we can leverage existing infrastructure to build our recommendation system.

Elasticsearch scales well and exists as fully managed deployments e.


on AWS.

There is no safer bet if you want to deploy your recommendation engine into production fast!How do you create a recommendation with a search engine?We store all user-item interactions in a search index.

When a user is on a page for apples we search for all users who have apples in elasticsearch.

This defines our foreground population.

We look for co-occurrence in our foreground which gives us puppies.

We search for puppies in the background population.

We calculate some kind of score for our puppy recommendation.

The good news: Elasticsearch implements all 5 steps for us in a single query!If we store our user-item interactions in elastic search as follows{ "_id": "07700f84163df9ee23a4827fd847896c", "user": "user_1", "products": ["apple", "book", "lemon", "puppy"]}with a document mapping like this:{ "user": {"type": "keyword"}, "products": {"type": "keyword"}}then all what’s needed to produce some recommendations is the following query e.


using Python:from elasticsearch import Elasticsearch, RequestsHttpConnectionfrom aws_requests_auth.

boto_utils import BotoAWSRequestsAuthes = Elasticsearch( host=host, port=port, connection_class=RequestsHttpConnection, http_auth=BotoAWSRequestsAuth(), scheme=scheme)es.

search( index=index, doc_type=doc_type, body={ "query": { "bool": { "must": { "term": {"products": "apple"} } } }, "aggs": { "recommendations": { "significant_terms": { "field": "products", "exclude": "apple", "min_doc_count": 100 } } } })Elasticsearch will return the recommendations with a JHL score by default but there is a range of scores available (documentation).

{ .

"aggregations": { "recommendations": { "doc_count": 12200, "bg_count": 130000, "buckets": [ { "key": "puppy", "doc_count": 250, "score": 0.

15, "bg_count": 320, } ] } }}In our example, the search for apple returned a foreground population of 12,200 users with a background population of 130,000 users who did not have any apples in their products.

Puppy co-occurred 250 times in the foreground and 320 times in the background.

The JHL score is a simple magnitude of change between the background collection to the local search results given by (fg_percentage – bg_percentage) * (fg_percentage / bg_percentage) which gives a score of 0.

15As the JHL score is a magnitude change it’s important to remember that fleas jump higher than elephants and the JHL score is very volatile for small data sets.

You can adjust the min_doc_count parameter in the query to quality assure your recommendation results.

This is it, a simple but powerful first iteration of a recommendation engine which can be live within a week or less!.Importantly, any version 1 shouldn’t be more complex than this.

Time to production is much more important in the early stages.

Commonly, your first recommendation engine needs a few iterations to optimise the UI and UX rather than the maths.

Next StepsElasticsearch is not just a very powerful backend for recommendations, it is also highly flexible!.There’re many options to improve our recommendation system while keeping the elasticsearch backend.

Win!Step 1:We can use more sophisticated algorithms such as ALS to create the indicators for our recommendations and we put these indicators into elasticsearch.

This simplifies the recommendation scoring to a simple look-up as we do the heavy lifting in the training phase e.


using Spark.

This way elasticsearch is just a performant presentation layer of our ahead-of-time computed indicators for relevant recommendations.

You can add this easily to an existing product catalogue as new metadata.

Step 2:At the moment we use a binary flag in the products array which means each product in the product array contributes to the JHL score equally.

Without many changes we could use some metric to score the product occurrence itself to capture a richer signal.

We could use a click count.

Or even better we could use a click score by normalising a click by the average expected click through rate of the page location generating the click.



in a list of search results we can calculate an expected CTR for items in first position, second etc.

We can then calculate the JHL magnitude change from the sum of item metric scores instead of their simple counts.

Step 3:Users usually generate a series of events relevant for recommendations, e.


clicking on multiple items or adding multiple products to a basket.

It’s worth adding a user interaction cache to your recommendation engine (1) to create more complex search queries using a series of events and (2) create a delta between the batch ETL process which updates your elasticsearch index and the user interactions which occurred since the last refresh of your recommendation engine.

Using the event sequence to produce recommendations can help with (1) creating more relevant results and (2) increase the foreground population to generate a bigger number of co-occurrences in case of low traffic volumes or very big catalogues.

It only needs a minor change to the elasticsearch query to switch to a should query:es.

search( index=index, doc_type=doc_type, body={ "query": { "bool": { "should": [ {"term": {"products": "apple"}}, {"term": {"products": "pony"}}, ], "minimum_should_match": 1, } }, "aggs": { "recommendations": { "significant_terms": { "field": "products", "exclude": ["apple", "pony"], "min_doc_count": 10 } } } })The minimum_should_match parameter allows you to optimise between increasing the foreground population size or making the results more relevant by matching users with increasing similarity.

Step 4:Currently, our search is a precise lookup of items.

This has some consequences: everything we learn from user interactions in terms of co-occurrence is bound to their specific items.

When an item is taken off the product catalogue we loose everything we learned from it.

We also cannot generalise anything to similar items, e.


red apples and green apples are distinctive items and co-occurrence is limited to precise matches of red apples or green apples.

To overcome this we need to describe items mathematically to compute a similarity between items.

This is called an embedding.

Read my previous blog post were I create a geographic area embedding.

Other options to create embeddings are auto-encoders or the matrix factorisation in the user-item model as described above.

After we turned a simple product_id into an embedding we can use probabilistic or fuzzy search to find our foreground population and/or co-occurrences.

This should get you started with recommendations.

It also gives you ample opportunity to build on your first iteration as you learn from production feedback.

The early steps into recommendations stand or fall much more often by UI and UX rather than the simplicity of the maths.

Beyond ElasticsearchUsually, products have a wealth of metadata we should use, e.


price, descriptions, images, review ratings, seasonality, tags and categories.

After we turn a rich metadata set into an embedding describing our products we can train a Neural Network to map the input embeddings into a recommendations embedding which has (1) lower dimensionality and (2) a desired behaviour of the cosine similarity of suitable recommendations being high.

One great solution for this are Siamese Neural Networks.

The input is a high dimensional vector of concatenated embeddings of a product’s metadata.

The output of the Neural Network is a much more compact recommendation embedding vector.

The error function is given by the cosine similarity of the output vectors.

We can use the collaborative filtering data to create our supervised learning labels for combinations which should be similar or not.

Importantly, in siamese neural networks the weights of both networks are always identical which gives them their name.

Such a recommendation engine would have no more cold start issue!.Finally, producing a recommendation can be done with a k-nearest-neighbour search of the output recommendation embeddings.

More on this in a later blog post.



com/in/janteichmann/I’m a highly skilled data scientist, data engineer and solution architect.

I hold a PhD in Mathematics from City University London and offer a strong background in machine learning, statistical modelling and programming.

I have extensive experience in big data, full stack development and interactive data visualisations which helps me to deliver engaging and comprehensive data science products.

I previously co-founded Cambridge Energy Data Lab where we celebrated a successful exit with Enechange, an utility comparison platform, which is now the market leader in Japan.

I use my skills now as the Senior Data Scientist at the Zoopla to lead the data science team.

We build models on Zoopla’s vast amounts of property market data, behavioural data, geo data, property images and text data sets.


. More details

Leave a Reply