Data science productionization: trustIt’s entirely reasonable for consumers of data science to ask why they should trust the analytic results they are given.
Schaun WheelerBlockedUnblockFollowFollowingMar 25This is the final part of a five-part series on data science productionization.
The rest of the series can be found at the following links:What does it mean to “productionize” data science?PortabilityMaintenanceScaleTrustI devote most of the posts in this series to the more technological aspects of productionization, although even those aspects are heavily dependent upon some very human processes.
But let’s say our code is all packaged, containerized, and version-controlled; that our workflows have all been automated; that all of the processes have technical and non-technical documentation; and that there are no major resource shortages or problematic integrations.
In other words, we’ve achieved technical productionization.
We’re not done.
Productionization, I believe, entails all the steps needed to ensure that the people who use your product are able to derive the full benefit from it.
Most companies live below their means: they get only a small portion of the total value of their available technology because they don’t use everything they have — in fact, they often don’t know they have it in the first place.
Machine learning models are often opaque — in many cases, even the person who builds and trains the model can have a hard time understanding why particular results turned out the way they did.
Moreover, data science is often used to allow computers to make decisions that humans formerly made.
Given those conditions, it’s natural that trust should be a major issue.
In cases of supervised learning, where a model is trained against a ground-truth dataset, there are a variety of metrics that tell you how well the model performed — how far off the predictions tend to be on average, or what percentage of records were mis-classified, and so forth.
For non-technical consumers of data science, however, those metrics are just more numbers — they don’t mean much.
What usually helps consumers trust a model is the ability to tell a story about how the model works.
Part of that story — the parts that explains how the model works in general — gets told in non-technical documentation.
But the other part of the story tells about the specific results the model obtained in the particular use-case to which it was applied.
I tend to break trust priorities down into three categories: interpretation, visualization, and ethics.
InterpretationAn easy and usually ineffective way to tell a model’s story is to show model coefficients (in cases of simple linear models like logistic regression) or importance scores (in cases of more complex models like random forests).
These numbers technically allow a user to understand which parts of the model were important, but it really begs the question — why should we trust that the model choose the right features as important?.Remember, trust is the central issue at stake.
If we haven’t given users a reason to trust the model, we haven’t explained it.
I’ve written previously about how, in my current work, we use mobile location signals and how, among other uses, we map those signals to parcels of land.
We had a need to decide whether a device’s visit to a parcel indicated that that parcel was the device owner’s place of work.
I won’t go into the details of the algorithm we developed — suffice it to say it yielded a score between 0 (definitely not a place of work) and 100 (definitely a place of work).
Users wouldn’t have gotten full use from the product if we just classified everything above 0.
5 as “work” and everything below as “not work”.
Some clients wanted volume and didn’t find false-positive to be very costly, so a 50% cutoff would have been too stringent.
At the same time, those clients didn’t want to throw money away by targeting locations that obviously weren’t work locations.
And, then there was maybe the more fundamental problem: what does it really mean for a parcel to have a 50% probability of being a work location?.What does that actually look like?The heatmaps here show when an individual device tended to visit four different parcels over a month-long period:The score we gave to each relationship is above each graph (the big long number is just a unique identifier for the parcel).
For each graph, each column is a day of the week, Monday through Sunday; each row is an hour of the day.
You can easily see that a score of 91 is very obviously a workplace — they are there primarily Monday through Friday, from about 9:00 in morning to about 6:00 in the evening.
On the opposite end, a score of 3 is pretty clearly not a workplace — the user is only there nights and weekends.
A score of 54 is kind of “worky”.
The visitation patterns are primarily at the right days and times, but coverage is kind of spotty.
By the time we get down to a score of 24, the pattern looks at best kind of random, and at worst looks vaguely like a residence.
These interpretation guidelines provided customers with the knowledge they needed to use the tool correctly, but also the confidence to feel they were using the tool correctly.
Both were necessary to get the tool adopted.
Visualization and UXOften, the best explanation is the one customers discover themselves.
A good visualization — ideally, an interactive one — allows the user to have a conversation with the model.
In other words, it allows to model, to some extent, to tell its own story.
That doesn’t mean the visualization has to be aesthetically very professional looking.
In fact, some of the most beautiful visualizations I’ve seen had very little informational value.
When we’re talking about building trust, a good visualization walks a user through the process of making a data-driven decision.
If a user can experience the process of going from not having a decision ready to having one, they will trust the results more, because most people have an inherent trust in their own judgement.
I worked for a time at an asset management startup where we were trying to make recommendations for potential investors in emerging economies.
We had data on historical product category consumption for many different countries.
We built a model that predicted consumption for each product category in each country for five years in the future.
The idea was that, by predicting the future (given, of course, all of the uncertainty involved in that prediction), our clients could make decisions about where they wanted to invest.
But the salient question for our clients wasn’t “will the country’s per-capita spend on chocolate be at a certain level in five years?” It wasn’t even “will the spend grow by a certain amount over five years?” It was “if I invest now, can I expect the growth in spend to be a good combination of rapidity and longevity?” If a market is going to grow exponentially for five years and then suddenly crater, it may not be a good investment.
And a market that will have little-to-no growth is clearly a poor investment.
To help our clients make these decisions, I created a simple interactive visualization that I called a fuse plot:Each country has a fuse.
The skinny part of the fuse is connects the per-capita spend from five years ago to the current spend — it’s how far the country has traveled in five years.
In other words, it’s the part of the fuse that has already burnt.
The thick part of the fuse is the part still left to burn — the distance between current spend to the anticipated spend five years from now.
The good investments are those where relatively little of the fuse has burnt out.
That changes our view of the investment opportunities.
There are some markets where spend has grown dramatically but we don’t expect it to last.
There are others where spend is relatively low and slow, but we expect it to speed up.
The visualization gives the customer the ability to enter the model and explore the results.
They need less explicit explanation of how the model works because they can intuitively discover those things themselves.
Good visualization often requires more customization than simply selecting a particular kind of canned plot.
Visualizations deserve to be designed just as well as any other part of the process.
EthicsEthics is often treated as a set of values or principles.
That’s not what I’m talking about here.
The purpose of visualizations and interpretation metrics is to build trust regarding the aspects of the model that users know they should distrust: “the model says I should spend my money on X.
Why should I believe that?” or “the model says that Y was the most important impact upon my operations?.What should I believe that?” In the context of productionization, ethics refers to tools you build to help users see that parts of the model that they didn’t know to question, but should.
An investigation by Bloomberg into Amazon’s offering of same-day delivery service revealed that poorer, more minority-populated zip codes were frequently excluded.
It looks like the algorithm recommended same day delivery in places where Amazon already had a lot of Prime subscribers, which subscribers tend to be fewer in number in poorer areas:Source: https://www.
com/graphics/2016-amazon-same-day/A fully productionized system has many parts that are both automated and modular.
This makes it very easy to introduce system changes that have unintended second- or third-order consequences.
Responsible productionization requires ethical safeguards, which over the long-term protect both the integrity of the system and the well-being of those impacted by the system.
Ethical growth is smart growth.
Systems that do not have ethical safeguards are systems at risk of catastrophic failure of trust.
If you have a system in place to flag and meaningfully address ethical issues as they arise, then when a major ethical problem really does occur, you already have a baseline level of trust and goodwill among your users.
If, however, the ethical failure is allowed to grow undetected before finally blowing up in your face, it’s very difficult (and usually expensive) to recover from that.
An article in the Washington Post, written by the owner of a restaurant, gave a good illustration of how ethics can be incorporated into the productionization process.
Employees used a color system to classify uncomfortable customer behavior as yellow (bad vibe), orange (offensive undertones), or red (overt harassment or repeated orange incidents).
All a staff member had to do was report the color — “I have an orange at table five” — and the manager took action, no questions asked.
A red led to the customer being asked to leave.
An orange caused the manager to take over the table.
A yellow meant the manager took over if the employee wanted.
Local ethical regulation of algorithm development doesn’t have to be any more complicated than the simple workplace example above.
If an employee feels uncomfortable about the implications of an analytic product, he or she should be moved to another project (of equal pay and prestige) if desired and there should be a team review of the issue.
If an employee can point to specific ethical concerns, he or she should be moved to another project and the team review should be required.
If an employee has a specific, strong concern it should put the project on hold pending review.
Yes, setting up and abiding by that kind of system would be inconvenient.
It’s a lot less inconvenient than a lawsuit or a press fiasco.
Even from a cold, hard business perspective, ethics is an essential component of smart risk management.
Trust is a must-have, not a nice-to-haveIn previous sections of this series, I’ve focused documentation, automation, integration, and other more-or-less technical ways to ensure that data science makes meaningful impacts within whatever business or organization that deploys it.
All the technical sophistication in the world can’t save a system that operates at a trust deficit.
Data science exists to do things that humans historically have done so humans can spend their time doing other things.
If humans don’t trust that capability to do its job — to do their job — then they won’t let it.
They’ll find ways to ignore it or undermine it.
That will cause the system to fail, and to fail at a much greater cost than what we would see from a shoddy technical implementation.
In the end, the “data” in data science is a placeholder for humans — both the people who generate the data and the people who leverage it.
Data science doesn’t work well or last if it is not built in a way that partners with people.
Partnership requires trust.