In addition to removing any ‘unknowable’ information, we have removed any terms that are non-informative, and have gone through a couple of text normalization steps to standardize our text.
From the description above, we can obtain the following descriptors:['hard tannin', 'plum', 'pepperi', 'mild', 'tannin', 'plum flavor']Starting with Grape VarietiesThe world’s best sommeliers can derive the variety, region and vintage of a wine when they are blind tasting.
This may be a little too ambitious for the first generation of RoboSomm.
To limit the scope of the exercise, we will start by seeing if we can train a model to predict the most common grape varieties in our dataset.
For simplicity, we will also remove blends.
The grape varieties we have left are:['Pinot Noir', 'Chardonnay', 'Cabernet Sauvignon', 'Riesling', 'Syrah', 'Sauvignon Blanc', 'Zinfandel', 'Sangiovese', 'Merlot', 'Malbec', 'Tempranillo', 'Nebbiolo', 'Pinot Gris', 'Grüner Veltliner', 'Cabernet Franc', 'Grenache', 'Viognier', 'Gamay', 'Gewürztraminer', 'Barbera']To familiarize ourselves with these grape varieties, we can build word clouds based on the descriptors that are most frequently associated with each:When a word appears more frequently, it will be larger in our word clouds.
We can see that many of the descriptors appear for almost every grape variety — fruit and tannins, for instance.
We might argue from the above that generally speaking, Syrah wines are more tannic than Grenache wines.
Despite some words appearing in several word clouds, we can see that there are notable differences between the grape varieties.
Sauvignon Blanc has more significant citrus notes than other wines.
Barbera wines are comparatively heavy on spice.
For Chardonnay, toast and apple come to mind.
Building ModelsWith our raw text converted to descriptors, we are almost ready to start building models.
First, we need to convert the text-based descriptors into numerical features.
We will use a simple method called one-hot-encoding to produce a matrix of 1’s and 0’s to highlight whether each descriptor is present for a given observation.
In addition, we should take a moment to consider how we will evaluate the performance of our models.
We would like RoboSomm to identify a wine in front of it with high precision.
When it makes a prediction, that prediction needs to be accurate.
We are less interested in retrieving all examples of a specific grape variety.
In addition, it would be nice if our model is able to identify many different grape varieties with high precision — not just the most common ones in our dataset.
As such, the key performance metric we will look to is the average precision across all categories (not weighted by the number of observations).
The final piece of housekeeping we should address before we get into our models is how we have defined a training and test set — the training set is used to train our model, and the test set is used to evaluate its performance.
Our training set is 80% of our dataset, and the test set constitutes the remaining 20% of the data.
Model 1: Decision TreeWhen sommeliers go through a formal blind tasting, they use a structure called ‘the grid’.
The grid is a memorized table of attributes.
In the world of data science, we could think of this as being similar to a decision tree.
Rounds of successive questions provide a neat structure that helps somms classify the wine in front of them.
Our decision tree model only achieves a low average precision on our test set (28%).
However, the decision tree can show us what successive trail of questioning can most effectively allow us to distinguish between grape varieties.
Let us visualize the first few layers of the tree (the actual decision tree is much deeper) to see what we can learn.
The visualization exercise reveals that some wines are highly distinctive.
Sauvignon Blanc, for instance, appears to be a very safe bet if there is a bell pepper taste/aroma and an absence of lychee.
Despite the higher number of red vs.
white grape varieties in our dataset (13 vs.
7), we can see that some Nebbiolo (balsamic) and Gamay (banana) wines have fairly unique flavor profiles.
There is no guarantee that all Sauvignon Blanc, Nebbiolo and Gamay wines have the characteristics listed out above, but if the characteristics are present, this is a good indicator of what prediction to make.
Model 2: Voting ClassifierA decision tree is a fairly simplistic classification algorithm.
There are other algorithms that are more sophisticated that may yield higher performance.
For this exercise, we have explored multiple options, but found that Multinomial Naive Bayes (MNB) and Random Forests Classifiers are the algorithms that yield the highest average precision for our test set.
To benefit from the relative strengths of these algorithms, we can go a step further and combine the predictions returned by each using something called a Voting Classifier.
The Voting Classifier will run both of the algorithms, and weights the predictions returned by each to come up with a single prediction.
Our Voting Classifier succeeds in bringing the unweighted average precision to 51%.
That seems a lot more… palatable.
Now that we have a classifier with better performance, let us take a deeper look at the predictions that are being returned.
Are there specific grape varieties that are just harder to predict than others?It appears that some are indeed easier to predict with high precision.
Gewürztraminer, Grüner Veltliner, Pinot Gris and Sauvignon (conveniently all white wines) can be predicted with levels of precision around 75%.
This is likely also a product of the fact that we only have 7 white grape varieties in our dataset.
Among the reds, we are getting relatively good precision for Nebbiolo and Zinfandel.
Despite our encouraging results, some grape varieties seem to be very hard to predict.
Cabernet Franc and Viognier are never returned as predictions for our test set.
Let us dive a little deeper into this.
Why might this be the case?To study the performance of our model in more detail, we will build a confusion matrix.
A confusion matrix is a grid that shows the number of correct and incorrect predictions, summarized with count values.
Our confusion matrix can help us tell stories about the predictions that are being returned.
For instance, we can see that 1,232 Pinot Noir wines in our test set were predicted correctly.
We can also see that Cabernet Franc wines are consistently predicted as either Pinot Noir, Cabernet Sauvignon, or sometimes Syrah.
Viognier wines are consistently predicted as being Chardonnay or Riesling.
We can identify two possible reasons (among many!) why this might be the case:We are dealing with an imbalanced dataset (there are many more Chardonnays than Viogniers, and many more Pinot Noirs than Cabernet Francs).
This makes it more likely that common grape varieties are returned as predictions, rather than grape varieties that are less numerous.
Although techniques to mitigate this were explored (undersampling, oversampling) these significantly decreased the overall performance of the model.
Viognier and Cabernet Franc are less distinctive grape varieties within our dataset than other, equally numerous grape varieties such as Gamay and Gewürztraminer.
A Final Note on Wine SimilaritiesIn an attempt to build a more comprehensive model, an alternative approach to engineering features was also explored.
Instead of having a matrix of 0’s and 1’s indicating the absence or presence of a descriptor in a description, we can choose to represent each descriptor with an embedding.
A word embedding is a way of representing a word as a vector.
By going through the process outlined in the image on the right, we can produce a single embedding for every wine description.
You will note that only single words have been extracted from the raw text instead of phrases — for more details on this, you can visit the code in the Jupyter Notebook referenced above.
Unfortunately, this approach did not yield better results than our ‘simple’ approach with 0’s and 1’s.
However, we can also use the method of creating average word embeddings to construct an average vector for all the descriptions of a given grape variety.
By then compressing this vector into two dimensions using a technique called T-SNE, we can build a visualization that can tell us how similar the descriptions for our various grape varieties are.
Although interpreting the space in T-SNE visualizations can be tricky, there are some lessons to be learned here.
We can see that white wines occupy one area in our plot, while red wines are all located towards the bottom.
Lighter reds such as Grenache, Pinot Noir and Gamay are located in the same part of the plot, with heavier, more tannic reds such as Syrah, Malbec and Tempranillo also in close vicinity to one another.
Cabernet Franc, one of the wines that was often misclassified as Pinot Noir or Cabernet Sauvignon, indeed lies between these two grape varieties.
For the white wines, we can see that sweeter grape varieties such as Gewürztraminer and Riesling lie relatively close to one another, while drier varieties such as Pinot Gris and Sauvignon Blanc occupy a different area in the plot.
It is somewhat surprising that Viognier and Chardonnay are so far apart.
Both of these wines are known as full-bodied whites and we know that Viogniers have consistently been misclassified as Chardonnays.
This may be due to the fact that we cannot wholly capture the similarity between these grape varieties in a simple, 2-dimensional plot.
ConclusionOverall, there is still much work to do before we can build a RoboSomm that can reliably identify wines in a blind tasting.
Nevertheless, we have learned a lot in this exploratory exercise.
We have looked at ways to convert text-based wine descriptions into a set of descriptors, and have studied the differences between 20 popular grape varieties.
We have built a few preliminary models to predict grape variety.
With there still being enough room for improvement, we may continue by investigating things such as:Can we use more sophisticated types of models to predict grape variety?Can we build flavor profiles for different grape varieties that are more comprehensive than the word cloud visualizations we have explored in this article?Can we go beyond just grape variety and build a model to predict the styles of wines by grape variety (e.
Oaked Chardonnay vs.
Unoaked Chardonnay)?In addition to predicting the style of wine, can we also predict the region and vintage associated with a wine?As a parting thought, let us borrow some words of wisdom from the great sommelier Fred Dame.
In life, and in data science….