Lot’s of the necessary steps early in a data-science project involve closing off dead-ends.
Seeing what doesn’t work, and how it doesn’t work, so we can get a better idea of how to find what does work.
We don’t expect a high success rate in the early stages, at least if ‘success’ is defined as the kind of actionable, conclusive prediction that the sponsors are eagerly awaiting.
No-one wants to take ownership of an experiment that isn’t certain to succeed, so every negative or ambiguous result is always attributed as being the data-scientist’s idea.
This wouldn’t be much of an issue were it not also the nature of plenty of outcomes is that the reason for failure, or a better possibility for success, always looks obvious in hindsight.
Taking measured risks is always necessary to narrow the possibility-space, but discovering the empty parts of it never seen as sexy, and can even come across as stupid.
The directions are hopelessly broadMost of the really useless directions I hear sound crisp and clear-cut to the business-person who speaks them.
Things like ‘optimise returns’, ‘reduce costs’, ‘increase efficiency’ are good starting points for a data-science project in the same way that a film studio's logo and jingle are a good starting point for a movie.
They just tell you that you’re starting a movie.
They tell you nothing, absolutely nothing, about the movie.
I guess it is true that most machine-learning algorithms are essentially programmed to do something like ‘try everything and bring back optimal result’.
What annoys data-scientists no-end is when managers around them seem to assume that data-science as a whole process works like that, and doesn’t even require some kind of definition of what an optimal result actually is.
I’ve often got the sense that I’m being discouraged from relying on subject matter experts for directions to help narrow the focus of the project, or even identify a suitable label in the data to train supervised algorithms on.
There’s some pervasive fear that letting experts into the process too much will pollute the pureness and objectivity of a data-science project, or picking a target is somehow narrowing down the options for what could be valuable.
It sounds a bit like:“We don’t want to bias the results by focussing on what the experts know.
Let’s not cut any data out, keep options open… Just come back and tell me what the data is telling us.
”There’s no way to plumb the depths of frustration that comments like this cause a data-scientist.
Real-world data has a freaky tendency to evade or defy every boundary or expectation that’s put on it.
The term ‘wrangling’ captures nicely the effort we exert trying to get it into a contained, manageable shape.
No matter how much you’ve cleaned and prepared the data, there are always surprises, in some way or other it escapes and becomes unwieldy in some other aspect or dimension.
And no matter how clear and specific a question you ask it, you never quite get a clear answer that doesn’t also immediately pose some other pressing question.
Start with a very specific goal, the data will push back, and you’ll move it to a better one, and eventually find a good one.
Leaving everything on the table and trying to narrow down never works.
This idea that keeping all the options open, and all the data in scope, using all possible methods, will somehow produce an incredibly valuable, powerful insight that no-one has ever thought of before is what I call the ‘Aladin’s lamp’ approach to data science.
It’s comically stupid, but remarkably common.
Data scientists get hired to rub, while executives wait for the genie to emerge and give them a song-and-dance before they come up with their three wishes.
If they’d thought carefully about what their three wishes would be beforehand, and what them being granted would look like, they’d realise how ridiculous all the lamp-rubbing really was.
They might also find that the data-scientist actually knows how to grant some of the wishes, using science, rather than magic.
Everything is expected to be clear and certainDon’t get me wrong, there’s nothing wrong with clarity, and aiming for it.
There are certain stages of a data-science project where the data and the label being targeted are well-defined and understood, and clarity might be good, and improving.
But they’re late stages, and most projects fail before they even get there.
But for most of a data-science project, which can (and arguably should) feel like ‘discovery’ for most of the time, a degree of comfort with inherent, unavoidable uncertainty is required.
One of the prominent figures in the Australian data-science scene, Eugene Dubossarsky once described statistics as being “the rigorous treatment of uncertainty”.
I think that’s a great phrase.
Treating uncertainty with rigour is definitely not the same as removing the uncertainty and replacing it with a definitive relationship.
Plenty of requests from senior managers fail to appreciate that there’s a trade-off between how easily understandable it is to them, and how much rigour it comes with.
The more of the c-suit c-words you impose on the way the result is presented (clear, conclusive, concise etc), the more violence we have to do to all the r-words (rigorous, reliable, robust, repeatable).
Sure, we can hack our way through the nuance and complexity to a clear, concise result, but it involves a string of assumptions and approximations which may or may not be appropriate.
If the executives took enough time to understand them properly, they’d be appalled and shocked and change tack immediately.
Instead, they shirk all the ‘technicalities’ by saying something like:“It’s not my job to understand how a result was produced, that’s your problem as the technical expert.
I just need to know what the result is, so I can explain why it’s important to the business.
”Sure enough, later in the project when some result is found not to be useful or reliable, and some necessary arbitrary assumption that was used to reach a premature level of clarity is exposed, the data-scientist cops the blame.
Regularly I get requests to produce fewer interesting plots with colours, gradients, and legends and instead use more rankings or scored lists.
“Sure, I can reduce the entire complex analysis to a single dimension which has only one possible interpretation.
But why on earth would you want to do that?” Naturally, they want to announce a clear discovery to all the stakeholders.
Or maybe point to the top of the list and tell the data-scientist to focus all their efforts there (with that chuffed look that expects my compliments on how skillfully they’re now directing the project).
Data-scientists produce plots precisely because they illustrate some important relationship that cannot be reduced to a single dimension.
And yes, plotting the data is still data-science, if it helps inform a business decision, including how we should best direct efforts in the next sprint.
If data-scientists could co-ordinate industrial action, they would refuse to produce a model using a neural network or gradient-boosted decision trees for anyone who won’t take the time to understand an important scatter-plot.
If we could do that, productivity would sky-rocket, and data-scientists would be much happier with the directions that were given, and spend less time looking for a better job.
How to do better?This could and should occupy many more posts, but a but a few takeaways are:Make sure one person owns the sprint plan.
Not a committee, and not the data-scientist.
Ideally the person who owns the problem that you’re trying to solve owns the solution, and the process of building it.
Define your goals and performance metrics as precisely as you can to begin with, but be prepared to change them (a lot) as the data throws you curve-balls, which it will.
Starting by trying to prove or measure something that you think should be trivial and obvious isn’t a bad place to start.
Then set more interesting targets, sprint by sprint, as the data reveals them.
A proper agile workflow is essential.
In data-science projects, the technical and business questions are necessarily deeply intertwined.
Don’t try to separate them.
Any good data-scientist will be hungry for more involvement from subject-matter experts not less, and this wish should be granted.
Senior people can and should be involved, including as the owner, but if they’re not technically well-versed, they should budget a lot of time for sprint reviews and planning, so that the technical people can bring them along on the journey, or delegate to someone else.
Asking for ‘top-line’ discussion to be separated from the technical throughout the project always ends in tears.