Data is not the new oilAbout the reality of working with dataSamuel FlenderBlockedUnblockFollowFollowingFeb 10If you work in data science or a related field, you probably have heard this quote before:“Data is the new oil.
”The quote goes back to 2006, and is credited to Mathematician Clive Humby, but has recently picked up more steam after the Economist published a 2017 report titled “The world’s most valuable resource is no longer oil, but data”.
Photo credit: Zbynek Burival, UnsplashThis all sounds pretty exciting.
But is it really true?Equating data to oil might make sense at first glance, given the data-driven success of tech companies, but the analogy breaks down as soon as you dig a little deeper (pun intended).
The thing with oil is, once an oil company find it in the ground, they know more or less which steps they have to follow to turn that oil into profits: drill, extract, refine, sell.
This is far from the reality that you face when dealing with data: when dealing with data, it is far from clear how exactly to turn that data into profits.
Handling dataIf you run a business, and you want to do anything with your data, the first thing you need to do is create the infrastructure required to store and query that data.
Data does not live in spreadsheets.
Let’s assume that you run a travel booking website.
Data is generated anytime someone searches, books a trip, clicks on an ad, or interacts in any other way with the content on the website.
In order to capture all of that data, you need to hire data engineers, and set up something like a Hadoop cluster that allows resilient data storage and rapid querying.
This is a big investment you would need to make up front.
Making sense of dataLet’s assume you did all that.
The database would tell you when which user clicked on what, what flights they booked, what hotels they booked, where they are traveling to, and when they are going.
Maybe you even have user profiles that contain demographic information: where users live, what age they are, and so on.
An incredibly rich dataset.
But what do you do with all that data?Well, you need to hire data scientist to figure out ways to turn that data into business insights, which in turn might be leveraged for profits.
That way, you will have a chance to get a return on your investment.
Enter data scienceData incredibly noisy.
Data scientist are trained to make sense of noisy data by looking at it under the following angle:What hypothesis can I make about the process in which the data is generated?.How can I test that hypothesis against the data?.What insights can I deduce from my hypothesis test?Notice how the workflow starts with an idea about the business process, and only then goes to the data — this is because the data is too noisy to provide intrinsic value.
The data scientist’s workflow rarely starts with the data itself.
Photo credit: Abigail Lynn, UnsplashAn exampleHere’s an example of what a data scientist might do with the above mentioned travel booking website data.
My hypothesis is that users are more likely to click on third-party ads if they are somehow personalized towards their travel preferences.
For example, I could show ads for surf schools to millennials who previously booked trips to Hawaii.
How would I test my hypothesis?First, I would need to implement such a personalized ad system at scale, which would require a significant amount of engineering work.
Second, once I have such a system up and running, I would need to perform something like an A/B test to figure out if the new system actually gives me an increase in ad click rates.
Only if that A/B test is successful, have I demonstrated a way for the business to generate profits with the data.
Are there other ways how the travel booking website could profit from the user data?.Probably there are dozens of ways, but you need to invest time, research, human resources (data scientist, data engineers, software engineers), and technological resources to find them.
This process contains many unknowns, research, and experimentation, and is therefore fundamentally different from the process of extracting, refining, and selling oil, where the steps are a lot more clear.
Photo credit: NASA, UnsplashData and creepinessFinally, when you build any systems that leverages your data assets, such as personalization systems, you have to be extremely careful not to be too creepy.
Creepiness includes any form of personalization that feels too intrusive.
A well-known example for creepy targeting (pun intended) is Target’s pregnancy detection model.
Target came up with the following hypothesis: Pregnant women are more likely to buy certain items such as prenatal vitamins and maternity clothing.
Target’s thinking went like this: if we can build an algorithm to find these customers, and send them coupons for baby products, such as diapers, they are more likely to make Target their one-stop grocery store once the baby has arrived.
As the New York Times reported, a year after Target started using this model, a man walked into a Target store outside Minneapolis, angrily demanding to speak to the manager.
Here is what he said:“My daughter got this in the mail!.She’s still in high school, and you’re sending her coupons for baby clothes and cribs?.Are you trying to encourage her to get pregnant?”As it turned out, his daughter was pregnant, and Target knew before him.
Data-driven personalization gone creepily wrong.
Netflix has learned the importance of personalization.
(Photo credit: Charles, Unsplash)Summary: data is not the new oilThe process of working with data is messy, requires careful planning, engineering, and research, and contains a lot of unknowns and pitfalls.
Most importantly, it is not always clear how to leverage the data, since data by itself is too noisy to provide value.
Equating data to oil is neglecting this messy and complicated reality.
That being said, one of the most powerful applications of data that we see today is arguably personalization, which drives the success of tech companies such as Amazon, Facebook, Google, Netflix, Spotify, and many more.
These companies have learned that personalized products are more successful than generic products.
Others, such as the banking or insurance industries, are learning, too: we live in a world where more and more digital products are becoming personalized.
Finally, the reality about oil is that its supply, as well as its use cases, are finite.
The reality with data is the opposite: as long as there are humans around, we will always create more data.