Is it possible to peek inside this Magic Black Box that is Machine Learning, and understand how and why a Machine Learning system arrived at a particular result?How to Good ScienceLet’s take a step back and contemplate the word “science” that makes up one half of “Data Science”.
It is hard to properly define what constitutes “Good” science, but, if we look to the STEM research fields, there are some established guidelines and principles to follow in order to avoid doing outright bad science.
During the designing of an experiment to test an hypothesis, it is of equal importance to put considerable thought into how to ensure an optimal execution of the experiment, by following these steps:Isolate the experimental setup to avoid contamination from external influences.
Choose proper metrics.
Accurately measure and store the sampled data.
Analyse and interpret the data.
Verify the results.
Especially important is The Zeroth Law: Remember to leave predetermined viewpoints at the door, and emotionally prepare for having the hypothesis completely disqualified…These guidelines are applicable to Machine Learning as well, if one thinks of the training of the model as being the experiment, and the resulting output by the model as being the data sampled and collected during the execution of the experiment.
Choosing suitable metrics, storing the output, and even verifying the results, are easily transferable concepts.
The challenge lies with controlling for outside influences and interpreting the output.
Because the training data that is put into the Machine Learning system contains both the desired signal that tests the hypothesis, but also external noise that contaminate and confounds the result.
And, when interpreting the result, we need to be able to decouple the signal from the noise.
Good Data Science requires InterpretabilityAcademic and corporate interests have been more focused on advancing the theories, algorithms, and software tools in order to perform increasingly more elaborate experiments, rather than on experimental verification, as in the more established physical sciences.
Fortunately, the need for accountability is catching up, germinating into the sub-field of Interpretable Machine Learning.
Interpretability in Machine Learning is not a well defined concept, because it is context dependent and will mean different things for different problems.
But, arguably, one can take it to mean opening up the magic black box, rendering it transparent, and being able to interpret, explain and trust what is going on inside it.
Speaking in general terms, to assess if one has achieved Interpretability, I propose that we should be able to answer the following questions with some degree of confidence (that is, given our hypothesis, our questions posed to the data):In part two in this series I will embellish on this checklist and supply examples to argue the relevance of these questions and outline potential outcomes of failing to answer them.
Understanding data is easier than understanding scienceIt might be tempting to buy into the pitch that “Your own engineers and developer teams understand your data the best, so just re-skill them”.
Sure, ideally every company that has some data-driven operational aspect, or provides data-driven services, should have an in-house team of data scientist and Machine Learning experts.
Also, due to the scarcity and high demand of this competency, it makes sense to retrain in-house developers.
A common misconception is that most data scientists are not programmers, but often has a Master’s Degree or PhD in STEM fields, such as Mathematics, Statistics or Physics, with years of experience in doing “Good Science”, and have acquired a generally sceptical attitude towards both data quality and model veracity.
However, if one does decide on going down the retraining road, it is highly recommended that the team adheres to the Checklist of Interpretability and becomes equipped with the following:The knowledge required to identify issues and shortcomings of the data.
The skills to properly pose the questions that one wants the data to answer.
The abilities to interpret and explain the answers the system outputs.
And lastly, for everyone involved, including the product owner, to gain a deep appreciation for The Precautionary Principle:To avoid using Machine Learning technology that is not fully understood in decision making systems where the stakes are high or critical.
Getting Machine Learning education and training right will play a significant role in the improvement of our collective future.
Achieving the common goals of halting climate change, fighting poverty and raising the living standards in the developing countries, requires us to make technology that is resource and energy efficient.
It must also be interpretable, explainable and proven worthy of our trust, so it does not cause unintended harm along the way.
After all, who would accept reduced access to electricity in order to lessen their ecological impact?.Would you?.. More details