The Loss of InferenceStephen ChenBlockedUnblockFollowFollowingFeb 1General Photographic Agency/Getty ImagesThe burgeoning field of Data Science / Machine Learning borrows heavily from Statistics but bastardizes it.
For example, “dummy variable” becomes “one-hot encoding”, “independent variables” become “features”.
This shift in nomenclature results in a loss of methodological meaning that was inherent in the original names; for instance, a casual Google search on the “auto-mpg” dataset will throw out many how-to pages, almost all of which treat the variables as “features” and throw everything (including non-independent variables) into the model.
This democratization of Data Science shifts the priority from explanation to production.
The how-to can-do approach contrasts with the why-not-to dictums of Statistics classes (or, at least in my time).
As a result, the meaning of inference based on triangulated evidence and reasoning, has increasingly shifted to a merely mathematical or computational problem.
What is unique to Data Science/Machine Learning is a set of purely algorithmic approaches (e.
Genetic Algorithms, Neural Networks, Boosting and Stacking) that have been increasingly hyped because of their superior “inference”, where they can outperform other methods / humans / whatever strawman on hand.
These algorithmic-based approaches are marketed as “learning from data”, but even that concept has been bastardized — the Bayesian approach of adjusting probabilities based on extant data has been reduced to an Algorithm that essentially attempts to fit a curve to as many points as possible.
In short, inference has gone from a sniper aiming to hit the target as close as possible, to firing a shotgun hoping one will hit close to the target via computational brute force.
Statistical inference is intimately tied to probability distributions — Gaussian, Poisson, Binomial etc.
are evidence-backed probability density functions corresponding to specific event characteristics.
There are application domains whereby algorithmic approaches are wholly appropriate (e.
genetic algorithms in robotics), and even necessary (neural networks and image classification) when it is difficult to operationalize probability density (and the scope of data and context are contained).
The illustration above shows why Data Scientists applying the latest algorithmic approach in a domain with known probability densities risk sacrificing predictive power for model accuracy.
They are either seduced by the latest “fad” or unaware that algorithmic-based approaches cannot predict beyond the limits of their input data range.
More importantly model accuracy should not the sole end in itself because accuracy and overfitting are two sides of the same coinAll this means it is more important than ever to practice the 3 criteria of model evaluation that are typically tossed out at the beginning of Research / Statistics classes, and given scant attention after.
Parsimony — the simplest possible model is the best model.
This recognizes one is overfitting at some point by adding more variables, and that each method introduces its own biases and data assumptions.
Validity — addressing potential biases in the data, and triangulating results with external sources vs.
accepting model-generated metrics as truth.
Reliability — is the extent in which the results are replicable in different / real-world contexts.
The current reliance on accuracy metrics in Machine Learning parallels the reliance on p-values by scientific community.
By itself, a highly accurate model or a highly statistically significant study does not guarantee it would perform similarly in a different / real-world context.
This problem of an ever-increasing accumulation of large bodies of unreplicable published research forced the ASA (American Statistical Association) to issue a statement on p-values in 2016, essentially saying they should not be used as the sole basis of evaluation, and are not a substitute for scientific reasoning.
Data Scientists/ML experts etc.
should take the same to heed as well (particularly in domains dealing with stochastic processes, shifting time series etc) as it typically takes less than 10% input noise/variance to break predictions.