Performance Metrics in Machine LearningMadeline SchiappaBlockedUnblockFollowFollowingFeb 26In this article I provide a brief overview of several metrics used to evaluate the performance of models that simulate some behavior.

These metrics compare the simulated output to some ground truth.

Distribution ComparisonsJensen-Shannon DivergenceJensen-Shannon Divergence (JSD) measures the similarity between two distributions (i.

e.

the ground truth and the simulation).

Another way to describe this metrics is the amount of divergence between two distributions.

The JSD is a symmetrized and smoothed version of the Kullback-Liebler Divergence, or D(p,q), which describes the divergence between probability distributions p and q.

One important thing to note is that D is not symmetrical, in that D(p,q) does not equal D(q,p).

Imagine we have a sample x and we want to measure how likely x is to occur in the ground truth distribution p as opposed to the simulation distribution q.

The likelihood-ratio (LR) will measure this:LR>1 indicates that p(x) is more likely while LR<1 indicates q(x) is more likely.

Now to get the overall ratio for the dataset x, we take the product for each sample:We take the log ratio to improve calculation:Where log(LR) values > 0 indicate that p(x) better fits while values > 0 indicates that q(x) better fits the data.

Using this value, we can better quantify how much better one model is over the other by answering how much will each sample on average indicate that p(x) better describes the data than q(x) if you are sampling from p(x).

This is also called its predictive power.

If you assume N approaches infinity then we get the expected value of:JSD symmetrizes and smooths this by:Kolmogorov-Smirnov TestThe Kolmogorov-Smirnov Test (KS Test) is a non-parametric test of equality of two continuous, one-dimensional probability distributions with a test statistic that quantifies the distance between the two distributions.

If the KS statistic is high or the p-value is low, there is support for the hypothesis that the two distributions are the same.

The first step is to sort the measured values in the sample then compute the cumulative probability S(x), the fraction of all measurements whose values are less than x.

In this case, S(x1 )= 0 and S(xn ) = 1.

Kolmogorov–Smirnov statistic for a given cumulative distribution function S(x) is the maximum absolute difference between the two cumulative probabilities:If the sample comes from S(x), then Dn converges to 0 when n goes to infinity.

One the statistics is calculated, you refer to the appropriate Kolmogorov-Smirnov table and based on your sample size to find the critical value in which if Dis greater than the critical value, the null hypothesis is rejected.

Non-Parametric versus Parametric TestsNon-parametric tests do not assume that the dataset follows a specific distribution while parametric tests require certain assumptions.

However, non-parametric tests do require that different groups in the dataset have the same variability/dispersion.

Parametric tests perform well on the following:When distributions are skewed and non-normal.

When the spread of each group is different.

When you want greater statistical power.

Non-parametric tests perform well when you want to assess the median over mean.

Parametric tests can detect changes in mean in skewed distributions because of a change in the tail.

Non-parametric analysis focuses on the median which is relatively unaffected by changes in the tail.

One-to-One ComparisonsRoot Mean Squared ErrorRoot mean squared error (RMSE) is the measurement of the difference between a predicted value from a simulation and a ground truth value.

Main use is for regression, for scalar values.

The first step to calculating the RMSE is by calculating the residuals by subtracting the predicted value by the actual value.

The next step is to average the squares of those residuals and then then square rooting that average.

The purpose of the square and then square root is to remove negative values.

Coefficient of Determination (R-Squared)The coefficient of determination (R-Squared) quantifies how good the simulation is compared to a baseline model with no independent variables that always predicts the expected value of y.

Mainly used in regression, one-to-one comparisons between the predicted value and ground truth value.

R-squared is any value between 0 and 1, where values closer to 1 indicate a greater proportion of variance accounted for by the model.

It is possible to get a negative $R²$ for equations that do not contain a constant term indicating that the fit is actually worse than just fitting a horizontal line.

If the value is below 0, it cannot be interpreted as the square of a correlation and is a good indication that the constant term should be added to the model.

An example of this in application is, let us assume that R-squared=93% then 93% of variation in the ground truth data is explained by our simulation.

Sequence ComparisonsDynamic Time WarpingDynamic Time Warping (DTW) [2] is a measure of how similar two temporal sequences are in a time series analysis.

DTW looks for the optimal alignment between the two series as opposed to looking at the Euclidean distance between two points at each time series.

We first calculate a distance matrix between time series 1 (A) and time series 2 (B).

The matrix has the points plotted for one time series on the vertical axis and the other on the horizontal.

We then compute the distance based on these values and a chosen distance metric D (usually Euclidean):Distance matrrix( from Regina J.

Meszlnyi, Petra Hermann, Krisztian Buza, Viktor G and Zoltn Vidnynszky 2017) [1]After the distance matrix is calculated, you construct the warp path, or the warp path W=(w_1,w_2,…,w_n ) by backtracking and a greedy search to minimize the distance.

Using the warp path, you get the distance value by summing the distances within the path:The smaller the distance, the more similar the temporal sequence.

Rank-Biased OverlapRanked-biased overlap (RBO) [3] measures similarity between infinite ranked lists which may or may not contain the same items by calculating overlap at various depths, bounding the average overlap of those depths by using a geometric series, a type of convergent series.

RBO falls in the range 0 to 1 where 0 means maximally disjoint and 1 means identical.

It can be shown that the indefinite sum of a geometric series is finite and given by:A geometric series is set of terms in which there is a constant ratio between successive terms.

An example is:The use of a geometric series allows us to explicitly model user’s behavior because the values in a geometric series decrease with the increasing depth, allowing you to model the likelihood of going from a given rank position i to i+1.

References[1] Regina J.

Meszlenyi, Petra Hermann, Krisztian Buza, Viktor Gal, and Zoltan Vidnyanszky.

Resting state fMRI functional connectivity analysis using dynamic time warping.

Frontiers in Neuroscience, 11(FEB):1{17, 2017.

[2] Stan Salvador and Philip Chan.

FastDTW : Toward Accurate Dynamic Time Warping in Linear Time and Space.

Intelligent Data Analysis, 11:561{580, 2007.

[3] William Webber, Alistair Mo at, and Justin Zobel.

A similarity measure for inde nite rankings.

ACM Transactions on Information Systems, 28(4):1{38, 2010.

.