Having previously mapped our descriptors using our wine wheels, we have already somewhat standardized the wine terms in our corpus.
This was done to eliminate unnecessary semantic nuance (e.
consolidate ‘wet stone’, ‘wet slate’ and ‘wet cement’ to ‘wet rock’), hopefully enhancing the quality of our Word2Vec model.
Our trained Word2Vec model consists of a 300-dimensional embedding for every term in our corpus.
However, we can recall from the previous step in this analysis that we only really care about the terms that are relevant descriptors of a wine’s sensory experience.
For our Pinot Noir, these were:dry, flower, sagebrush, elegant, tarragon, pepper, tangy, cranberry, light_bodiedIn the adjacent image, we can see the word embedding for each of these mapped descriptors.
Step 5: Weight each word embedding in the wine review with a TF-IDF weighting, and sum the word embeddings togetherNow that we have a word embedding for each mapped descriptor, we need to think about how we can combine these into a single vector.
Looking at our Pinot Noir example, ‘dry’ is a fairly common descriptor across all wine reviews.
We want to weight that less than a rarer, more distinctive descriptor such as ‘sagebrush’.
In addition, we want to take into consideration the total number of descriptors per review.
If there are 20 descriptors in one review and five in another, each individual descriptor in the former review probably contributes less to the overall profile of the wine than in the latter.
Term Frequency-Inverse Document Frequency (TF-IDF) takes both of these factors into consideration.
TF-IDF looks at how many mapped descriptors are contained within a single review (TF), as well as at how often each mapped descriptor appears in the 180,000 wine reviews (IDF).
Multiplying each mapped descriptor vector by its TF-IDF weighting gives us our set of weighted mapped descriptor vectors.
We can then sum these to obtain a single wine embedding for each wine review.
For our Pinot Noir, this looks something like:Building a Wine RecommenderNow that we have our wine embeddings, it’s time to have some fun.
One of the things we can do is produce a wine recommender system.
We can do this by using a nearest neighbors model, which calculates the cosine distance between various wine review vectors.
The wine embeddings that lie closest to one another are returned as suggestions.
Let’s take a look at what we get as suggestions when we insert our Point & Line Pinot Noir from earlier.
Which of the 180,000 possible wines in our dataset are returned as suggestions?Wine to match: Point & Line 2016 John Sebastiano Vineyard Reserve Pinot Noir (Sta.
Rita Hills)Descriptors: [dry, flower, sagebrush, elegant, tarragon, pepper, tangy, cranberry, light_bodied]________________________________________________________________Suggestion 1: Chanin 2014 Bien Nacido Vineyard Pinot Noir (Santa Maria Valley)Descriptors: [hibiscus, light_bodied, cranberry, dry, rose, white_pepper, light_bodied, pepper, underripe, raspberry, fresh, thyme, oregano, light_bodied, fresh]Suggestion 2: Hug 2016 Steiner Creek Pinot Noir (San Luis Obispo County)Descriptors: [fresh, raspberry, thyme, pepper, rosemary, sagebrush, dry, sage, mint, forest_floor, light_bodied, cranberry_pomegranate, tangy]Suggestion 3: Comartin 2014 Pinot Noir (Santa Cruz Mountains)Descriptors: [vibrant, tangy, cranberry, hibiscus, strawberry, pepper, brown_spice, pepper, spice, bay_leaf, thyme, herb, underripe, raspberry, cranberry, fruit]The top three wines returned are all Pinot Noirs from California.
Looking at the descriptors for these wines, we can see that they are indeed very similar to our original wine.
Cranberry features in every one of the suggestions.
Because of the way the wine embeddings have been constructed, the semantic similarity of non-identical terms is also taken into consideration.
For example, the word ‘flower’ in the original wine review is similar to ‘hibiscus’ and ‘rose’ in the first suggestion.
If we look at the top ten wine suggestions for our Point & Line Pinot Noir (see this Jupyter Notebook for the full list), we can see that the recommendations are remarkably consistent.
All ten wines come come from California, and nine out the ten are Pinot Noirs.
Five are even produced within a 60-mile radius of our original wine.
The only wine that is not a Pinot Noir is a Cabernet Franc from the Santa Ynez Valley, a mere 25-minute drive from where our Point & Line Pinot is produced.
The geographical origin of our Pinot Noir wine appears to have a very strong effect on its sensory profile, allowing for it to be matched with other similar wines in its direct vicinity.
The adjacent map illustrates just how geographically concentrated our wine recommendations are.
The remarkable performance of this recommender model does beg the question: how is it possible that the suggestions returned are so specific to a single geographical area?At its core, this analysis is entirely dependent on the wine reviews used to construct the wine embeddings.
In this post, a taster for the Wine Enthusiast explains how wines are rated on the www.
Although ratings are given through a process of blind tasting, it is not entirely clear whether the text description in the review is also a product of an unbiased evaluation process.
It is possible that reviewers, having seen the bottle, consciously or unconsciously attribute certain terms to specific types of wine (e.
‘sagebrush’ for Pinot Noirs from Southern California).
On the other hand, it is also entirely possible that these wines truly exhibit sensory profiles that can be attributed to specific grape varieties, terroirs and wine-making styles.
The professional reviewers from the Wine Enthusiast may well have such finely-tuned palates that they can pick out these nuances in each wine, without having seen the bottle.
Using Descriptors to Suggest WinesAs a final exercise, we can take a slightly different approach to leveraging our wine recommender.
Let’s say that we are looking for a wine with specific characteristics.
On a hot summer’s day, we might feel like a wine that is fresh, high in acid, and has aromas of grapefruit, grass and lime.
Taking the RoboSomm wine wheels for a spin, we can pick the descriptors that match these characteristics: ‘fresh’, ‘high_acid’, ‘grapefruit’, ‘grass’ and ‘lime’.
Feeding these descriptors into the wine recommender, we get the following suggestions:Suggestion 1 : Undurraga 2011 Sibaris Reserva Especial Sauvignon Blanc (Leyda Valley)Descriptors: [minerality, zesty, crisp, grass, lime, grapefruit, lemongrass, angular]Suggestion 2 : Santa Rita 2012 Reserva Sauvignon Blanc (Casablanca Valley)Descriptors: [snappy, pungent, gooseberry, grapefruit, lime, racy, lime, grapefruit, nettle, pith, bitter]Suggestion 3 : Luis Felipe Edwards 2015 Marea Sauvignon Blanc (Leyda Valley)Descriptors: [punchy, grass, citrus, tropical_fruit, fruit, angular, fresh, minerality, tangerine, lime, lemon, grapefruit, tangy]All three of the wine recommendations are Chilean Sauvignon Blancs, with two coming from the Leyda Valley.
Once again, it is noteworthy how geographically concentrated the suggestions are.
Especially considering that the wine recommender has 180,000 different wines to choose from!ConclusionThere is no shortage of ways in which we can use our wine embeddings.
Our simple wine recommender model suggests that it may be worth further investigating wine styles through the lens of geography.
What is the influence of terroir vs.
wine-making style?.Do geographical differences establish themselves in the same ways for different grape varieties?.Perhaps we can also learn more about the process by which wine reviews are written and the extent to which biases drive the use of certain descriptors.