Big Data’s black box problemGordon WebsterBlockedUnblockFollowFollowingMay 20If you’ve ever read one of those murder mystery novels in which the murderer is finally unmasked by the tireless and dogged detective at the end of the story, you know that the reader’s gratification doesn’t come solely from getting to know “whodunnit” — but also in knowing why and how they did it.
There’s an analogous situation with many of the big data analytical tools and algorithms that are often marketed to biomedical companies as “solutions” to their research and development challenges.
Big data approaches generally promise answers in return for data (usually lots and lots of it), but much like the planet-earth-as-giant-computer scenario in “The Hitchhiker’s Guide To The Galaxy”, the answers can be just as baffling as the questions if they cannot be adequately explained.
And just to be clear, we’re not even talking about the “correlation is not causation” issue here.
Yes, an ill-considered big data solution might tell me that I shouldn’t invest in companies that manufacture margarine while the divorce rate in Maine is declining; or that I should stay away from swimming pools when the actor Nicholas Cage is about to release a new movie — but the issue of data misuse is not the itch that we’re trying to scratch here.
In the context of scientific research at least -Answers need to explain things.
They really do.
Imagine you’re a scientist at a commercial life science company, presenting the latest results from your big data “solution” provider to the people who will be deciding whether or not to fund your R&D program for the next financial year.
After 30 or 40 dense slides of — and you can choose your poison here — regression analysis, k-means clustering, neural network activation functions, non-probabilistic, binary, linear classification etc.
, you finally announce to the room that the algorithm has determined that x is the critical therapeutic target.
Cue the sound of crickets chirping.
While this “solution” may superficially have more meat on it than the “42” delivered by the planetary computer in “Hitchhiker’s Guide”, it is barely any more useful.
Imagine for a moment that you are one of the people in the room that is tasked with deciding whether or not to give the green light to the investment of let’s say, a further seven-figure dollar sum in this R&D program.
Based upon this presentation, wouldn’t you have a few questions?What if you ran the algorithm with a different set of weights / thresholds / input parameters -would you get the same answer?What is the probability that this answer is “correct”?How much more likely is the top answer to be correct than the second (or subsequent) answers?How many of these answers have probabilities that make them worth testing in the laboratory?And in all of this discussion about a life science R&D program, the one thing that seems glaringly absent is any actual life science.
The biology being described by the data is barely perceptible amidst this welter of numerical trends, parameters and probabilities — and herein lies a big part of the big data problem.
What has been largely lost in this process is any causal explanation of what is going on.
Again, the problem is not that correlation is being confounded with causation — it’s that absent any causal explanation, it’s very hard to know what to make of any correlation.
Funding this R&D program on the strength of a set of numerical trends or correlations is something of a leap of faith; kind of like agreeing to the 56 pages of the Apple iTunes terms and conditions agreement without having really read them (because who has the time to?) — except that the potential consequences of this particular leap of faith are probably much more far-reaching.
And in the back of your mind, there’s always that nagging doubt about making the decision to fund the program without really having a clear rationale for it.
If the program falters or fails, how do you explain that you made your decision to proceed without really understanding why?In the best case scenarios, big data approaches often deliver so much “solution” that it’s hard to see the wood for the trees when it comes to evaluating it with a view to making decisions based upon it.
In the worst case scenarios, the big data approach is a black box that accepts data and outputs an “answer” that seems to have been pulled out of thin air.
The real kicker for the research scientist is that they are working in a causal idiom in which knowledge and insight are built upon an intellectual framework of cause and effect.
In their world, data (or trends in data) are not knowledge (with a big K).
Real Knowledge is based upon causal explanations of processes and events that can serve as the substrate for new hypotheses and designs for experiments that can test them.
The scientist’s whole approach to evaluating new information is predicated upon this, which makes it very difficult to evaluate in a scientific context, the kind of correlative, data-intensive output from the big data approaches that we have discussed here.
Looking at all of this from the scientist’s viewpoint -What most big data approaches seem to generate is just more data, not real knowledgeIncidentally, all of the above is why we favor mechanistic, causal modeling in the work that do for our own life science consulting clients.
Nowhere more than in the biomedical field, is it important to get real, explanatory answers rather than just the mysterious, cryptic musings of a black box.
At a very fundamental level, this problem relates directly to who we are — it’s about being curious and wanting to understand things — it’s about being human.
We don’t just want an answer, we need to understand why it’s the answer.
© Gordon WebsterGordon Webster is a partner at the digital biology consulting firm Amber Biology, a Ronin Scholar and a co-author of Python For The Life Sciences.