However, the accuracy was significantly higher at 95%.
Below is the ROC curve.
Ok, but if you can get 73% accuracy with just does it have lime flavor or not, that’s not really fair.
Let’s go back to just basic characteristics (no tastes) — with Random Over Sampling.
Attempt 3: Random Over Sampling with basic characteristicsFollowing the same method as described above where I add features until the accuracy does not increase, we get a resulting logistic regression formula of [Y ~ body + alcohol + fruity] with an accuracy of 91%.
The ROC curve looks like this:So slightly worse, but really not that bad if we are able to predict this accurately with just three basic characteristics of a wine.
Interpreting the FindingsAs discussed above, the objective here is to understand — given a model that classifies wines as red or white with decently high accuracy, which wines are more difficult to classify, or more easy to misclassify.
Below is the data for each type of wine, whether it was true that it is a red wine, and the predicted probabilities of being white or red from our final model.
Our final output is a nice visualization that demonstrates which wines are most likely to be misclassified as red or white, based on this model.
The wines that were either misclassified or close to being misclassified, are named below.
Why does this matter?Generally, body alcohol and fruit are similar within red and white wines.
We can classify red or white based on those 3 attributes with 91% accuracy.
The red wines that were closest to being predicted as white, or misclassified, are those that are most similar to white wines on these attributes — and wines that white wine lovers may like!Thanks for reading!.Code and data are here on data.