Let’s face it: at present, data science at its best is about recognising pictures of cats.
Now, automatically identifying pictures, identifying verbal instructions or recommending what to type is obviously tremendously useful, as it can enhance user experience as well as automate tedious and labour-intensive tasks.
But outside of Siri and Google Translate, people in a business context are generally interested in knowing the effects of decisions they make, to see if they should make them at all.
Voice assistant, coming to a movie theater near you (courtesy: Sharon Graphics)A cat has features which define it to the exclusion of anything else, ears, tails, whiskers, you name it.
So that in the end, a spade is a spade, and, well, a cat is a cat.
But when it comes to humans, things get messier.
A customer’s purchasing habits, a fraudster’s embezzling techniques or an Uber driver’s ride patterns are the result of people’s responses to their environment, taking into account the costs and benefits of their actions (though maybe not to the fullest extent).
These costs and benefits can vary through time, as a response to, say, changes in the law or available technology.
They can also change as a result of learning by the participants.
An example will better illustrate why a dose of theory becomes tremendously important in such contexts: just imagine I’m trying to sell you a regular pencil for $1,000.
You wouldn’t buy it and probably go: “No way, that’s not the price of a pencil!”.
And you’d be right because my offer is not a price, until you and I agree to the exchange.
This basic toy example can have dramatic consequences in the real world: in discussions with data scientists at a relatively large tech company, they were really puzzled to hear that prices posted by providers on their platform actually became price information only when the sellers met a buyer.
This meany that no less than 75% of the data should be thrown away, as most postings didn’t end up in a transaction.
Data about human behaviour is suffused with this kind of issues, of which price bargaining is just one.
What this means is that any model that one blindly fits to the data will only be as good as the data it’s being fitted to, until something new comes along which changes the nature of the interaction between agents.
So since the mother-of-all-data-sets doesn’t exist, what theory does is weave in as common knowledge what one has learned from other unconnected data sets.
This is the obvious role of theory as generalisation.
Cause and effectNow theory doesn’t only prove more robust in the case of (forever) incomplete data sets.
There is a growing trend in data science focusing on causal inference, that is trying to answer the question: why did this happen?.That is going further than pure correlation (e.
people who smoke die sooner) and instead figure out what happens so we can act on it (e.
the smoke from cigarettes contains all sorts of things that get stuck in the lungs, potentially degenerating into cancer: from this you can safely conclude that stopping to smoke helps in avoiding a premature ending).
There are two issues with establishing causal inference purely on statistical grounds: robustness and interpretability.
The robustness issue is another manifestation of the point raised in the previous section: you can’t safely generalise based on your data alone if you don’t have an idea of the mechanism behind your observations.
As for interpretability, decision making always requires focusing on the right drivers of value and risk.
In some sense, theory is a variable selection tool in a way that raw statistical techniques don’t aim at doing.
Take for example LASSO and other regularization techniques: it is somewhat well known that the underlying variables depend in general on the selected subset of the data.
So much for interpretability!.And it goes without saying that deep learning techniques, which are essentially high performing prediction black boxes are even worse in that respect.
In some cases, one might be able to use A/B testing to isolate the causal relationship.
However, without theory, this method will suffer from so-called interference bias many cases, notably for platforms.
To understand this, let’s imagine that Deliveroo were to set up some sort of priority delivery.
In this case, because Deliveroo is a multi-sided platform, linking customers, restaurants and delivery people, one can’t just go ahead in the usual way and test on a representative subset of the participants.
The issue here is that the experiment is changing the competitive landscape between the control group (where the priority delivery feature isn’t activated) and the treatment group (where consumers can choose priority delivery).
As a result, theory (here supply and demand analysis, coupled potentially with game theory or industrial organisation) is needed to guide the setup of the experiment to take externalities into account and neutralise any unwanted effects.
Don’t miss outTo sum it up, relying on data alone, in an atheoretic way is not possible if one is interested in decision making.
If you’re still doubtful despite the above discussion, bear in mind that it’s not just me saying this but econometricians like Koopmans, or computer scientists like Peter Norvig or Judea Pearl.
And before them, this approach traces it’s roots to philosophers like Kant.
Correlation is a wonderful thing, but (to paraphrase Eurythmics) don’t let yourself be abused!.