Then you may consider additional metrics like Precision, Recall, F score (combined metric), but before diving in lets take a step back and understand the terms that form the basis for these.Some Basic TermsTrue Positive — Label which was predicted Positive (in our scenario Authenticated Bank Notes) and is actually Positive (i.e. belong to Positive ‘Authorized’ Class).True Negative — Label which was predicted Negative (in our scenario Forged Bank Notes) and is actually Negative (i.e. belong to Negative ‘Forged’ Class).False Positive — Label which was predicted as Positive, but is actually Negative, or in simple words the Note wrongly predicted as Authentic by our Model, but is actually Forged..In Hypothesis Testing it is also known as Type 1 error or the incorrect rejection of Null Hypothesis, refer this to read more about Hypothesis testing.False Negatives — Labels which was predicted as Negative, but is actually Positive (Authentic Note predicted as Forged)..It is also known as Type 2 error, which leads to the failure in rejection of Null Hypothesis.Now lets look at most common evaluation metrics every Machine Learning Practitioner should know!Mathematical Definitions (Formulas)Metrics beyond AccuracyPrecisionIt is the ‘Exactness’, ability of the model to return only relevant instances..If your use case/problem statement involves minimizing the False Positives, i.e..in current scenario if you don’t want the Forged Notes to be labelled as Authentic by the Model then Precision is something you need.#Precision Precision = tp/(tp+fp) print("Precision {:0.2f}".format(Precision))Precision is about Repeatability & ConsistencyRecallIt is the ‘Completeness’, ability of the model to identify all relevant instances, True Positive Rate, aka Sensitivity..In the current scenario if your focus is to have the least False Negatives i.e..you don’t Authentic Notes to be wrongly classified as Forged then Recall can come to your rescue.#Recall Recall = tp/(tp+fn) print("Recall {:0.2f}".format(Recall))F1 MeasureHarmonic mean of Precision & Recall, used to indicate a balance between Precision & Recall providing each equal weightage, it ranges from 0 to 1..F1 Score reaches its best value at 1 (perfect precision & recall) and worst at 0, read more here.#F1 Scoref1 = (2*Precision*Recall)/(Precision + Recall)print("F1 Score {:0.2f}".format(f1))F-beta MeasureIt is the general form of F measure — Beta 0.5 & 2 are usually used as measures, 0.5 indicates the Inclination towards Precision whereas 2 favors Recall giving it twice the weightage compared to precision.#F-beta score calculationdef fbeta(precision, recall, beta): return ((1+pow(beta,2))*precision*recall)/(pow(beta,2)*precision + recall) f2 = fbeta(Precision, Recall, 2)f0_5 = fbeta(Precision, Recall, 0.5)print("F2 {:0.2f}".format(f2))print("..F0.5 {:0.2f}".format(f0_5))SpecificityIt is also referred to as ‘True Negative Rate’ (Proportion of actual negatives that are correctly identified), i.e..more True Negatives the data hold the higher its Specificity.#Specificity Specificity = tn/(tn+fp)print("Specificity {:0.2f}".format(Specificity))ROC (Receiver Operating Characteristic curve)The plot of ‘True Positive Rate’ (Sensitivity/Recall) against the ‘False Positive Rate’ (1-Specificity) at different classification thresholds.The area under the ROC curve (AUC ) measures the entire two-dimensional area underneath the curve..It is a measure of how well a parameter can distinguish between two diagnostic groups..Often used as a measure of quality of the classification models.A random classifier has an area under the curve of 0.5, while AUC for a perfect classifier is equal to 1.#ROCimport scikitplot as skplt #to make things easyy_pred_proba = LR.predict_proba(X_test)skplt.metrics.plot_roc_curve(y_test, y_pred_proba)plt.show()ConclusionSince the problem selected to illustrate the use of Confusion Matrix and related Metrics was simple, you found every value on higher level (98% or above) be it Precision, Recall or Accuracy; usually that will not be the case and you will require the domain knowledge about data to choose between the one metric or other (often times a combination of metrics).For example: if its about finding that ‘spam in your mailbox’, high Precision of your model will be of much importance (as you don’t want the ham to be labelled as spam), it will tell us what proportion of messages we classified as spam, actually were spam.. More details