Data Types for Data SciencesAlainChabrierBlockedUnblockFollowFollowingMar 19Big Data and Data Science is now in everyone’s mind.
But not everyone clearly understands that not all data is the same, and has a clear vision of the types of applications and technologies available from Data Science.
Data Science, Artificial Intelligence and Machine learning are often considered as quite equivalent.
It is critical to understand that not all data is the same in order to understand that all data science techniques are not equivalent.
This slide is the main slide of my presentation at Big Data Corp in Paris this month.
This kind of conference is full of people who have been, for years, creating software for companies to better manage their data in order to better manage their business.
When it comes to Artificial Intelligence or Machine Learning, which are important buzzwords nowadays, I feel there is some confusion and it seems they are all considered more or less as equivalent.
I hope then to clarify that different types of data exist, with different needs, which might benefit from different types of science.
As the amount of data has been increasing, very significantly, we now talk about Big Data.
We don’t want to just manage data, store it, and move it from one place to another, we want to use it and make clever things around it, use scientific methods.
This is Data Science.
In short, Data Science “uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms”.
This is clear in the definition, there are different types of methods, processes and algorithms.
One of Data Science techniques which works best with abundant data is Machine Learning, as it uses data to extract knowledge.
Other Data Science techniques such as Decision Optimization are not so data consuming as they are based on domain knowledge.
With just one data set and a formulation of the business problem, you can start using Decision Optimization.
On the other hand, Machine Learning is addicted to data.
More data, more learning, better outcomes.
This seems to be the new most important buzzword, but in fact this is pure vintage.
According to Wikipedia this is simply “intelligence demonstrated by machine”.
The idea of machine being intelligent like human has been here for years.
We are now more modest and consider Artificial Intelligence to be whatever a machine does to automate a non physical process in place of a human (in fact we have also reduced our expectations about human intelligence).
Some of us have been doing some kind of Artificial Intelligence for years, with Rules automation or with Decision Optimization.
The good thing with this hype on IA is that my kids now think I have a fun job while they thought for years that my work was just boring mathematics.
Types of dataBefore looking at which science can be beneficial for a problem, I need to look at what types of data are involved.
Let’s go over the four types I have highlighted.
I will illustrate using examples from typical and well known supply chain problems where I want to plan how many items to produce in my plants, how much to stock in my warehouses and how much to deliver to my stores.
Known DataFirst there is what I call “known data”.
This is data which I am sure of, or at least I can consider as given.
In a supply chain operation problems, the topology of the chain is given with the capacity of production and storage being known at different nodes.
How many items customers will want to buy is not known data.
What prices will be used by competition is not known data.
How many items I want to produce and stock is not known data.
The territory of known data corresponds to Descriptive Analytics.
I will extract these data, from operational systems, I will organize it, I will explore it, I will display it.
This data is very valuable.
If I don’t know the properties of my production plants or my warehouses, I cannot do any serious planning.
Not all data is “as valuable”, the is a notion of “good data” and this is not “more data is better”.
If the data is not structured, it will be harder to use.
So as good data is valuable, we can buy it.
Not all data is open source and capturing and reselling good data is a great business.
Unknown DataThe structure we can find in known data is in fact additional data.
This is data we did not know initially and that we can extract from the known data.
We can classify, we can structure, we can forecast.
In my Supply Chain example, based on lots of known historical data, I might predict how much demand I will have for my different stores or markets in the next month.
This is the territory of Predictive Analytics.
As said before, this area is fed with known data.
The predictions, classifications do not come out from a crystal ball, but are extrapolated from historical data.
Without known data, you cannot extract any unknown data.
Predictions are just like an additional dimension you cannot see from your known data, but already exists, and which you can see using some specific glasses.
And the predictions will depend on the quality, and variety of known data you have.
This is the problem of bias that everyone is talking about.
Others’ decisionsThe next area of data is the data someone else will set.
In my Supply Chain example, an important data set relates to the competition: where, what my competitors will sell and at which price, this is their decision, not mine.
In some cases, there is some overlap with the previous area.
With decisions being taken on a market, with some characteristics, we can expect outcome will follow trends that a predictive model can extract.
But in some other cases, this can really be an individual decision from someone, who may have a strategy and take unpredictable decisions.
For example, one competitor can decide to open one new store next month with special offers.
This will seriously impact my sales plans.
While the news is full of stories of companies focusing on Artificial Intelligence for computers to play and win over humans at games, this area is not, IMHO, the most important in practice, with industrial, transportation, supply-chain, production, etc… problems.
Play is most of times based on multi-step games.
Without multiple steps, using unpredictable strategy does not make sense.
If I do something different, this is because I expect to confuse my adversary with something he could not predict and/or make him react in some way.
This is the area of Game Theory.
If I consider the average adversary, then I will consider predictable reactions from my adversary and most certainly I will use strategies which can be predicted, and hence this belongs to the previous area.
Your decisionsFinally there is the set of data corresponding to your decisions.
I make my decisions in line with my rules and my objectives.
This is the area of Prescriptive Analytics.
For my Supply Chain example, I will decide how much I want to produce in my plants, how much I want to stock in my warehouses, and how much I want to deliver to my stores in addition to the price I want to set.
There are different ways to take decisions, and while Machine Learning can be in some cases very powerful to prescribe what to do, this area is still, as of today the kingdom of Decision Optimization.
With Decision Optimization, a mathematical engine is fed with a description of my business (rules and objectives), and with a particular case (current situation of my system, some forecasts for unknown data), and the engine will deduce what is the optimal set of decisions for me.
Decision Optimization is a knowledge-based technique.
Note that in practice, real life problems do include data from all these categories, but while known data is always a very significant proportion of the data, the amount of data for the other categories can change from one problem to another.
This is why it is important to clearly understand these types of data to be able to select which type of data science technique to use.
Types of Data ScienceEach of the areas which I have highlighted do not correspond precisely to one technique from data science.
The areas correspond to types of processes we want to perform on the data, they correspond to intentions.
I focus here on the two that I consider more important, and where more confusion lies:On one hand, the area of Unknown Data corresponds to Predictive Analytics where the intention is to predict unknown information (data or structure of data) from the known data, and different techniques exist, from well-known predictive models using regression techniques, to more recent machine learning and neural networks.
On the other hand, the area of My Decision corresponds to Prescriptive Analytics.
Here again, this is not linked to one and only one data science technique, but to one intention: prescribe the next best actions to take for a current situation.
In some cases Machine Learning might be the best technology, but in most cases Decision Optimization (known in the past as Operations Research or Mathematical Programming) is still technically the dominant player in this area.
This is because Decision Optimization has direct impact on everyday decisions: it tells you what to do when faced with a choice of thousands or millions of possibilities.
It’s an expert advisor for your decisions.
This isn’t covered much today in newspapers.
In another post, I use a common everyday situation to introduce the different existing techniques.
ConclusionsSo my conclusion is that we should be careful and not directly link data and data science to artificial intelligence and machine learning.
There are different types of data to consider when we face a complex problem with lots of data.
For different types of data, there are different operations we might want to execute, and while we want to apply Artificial Intelligence and Data Science, we should consider different types of data science and different technologies, using the ones that best fits to the data and the intention we consider.
For a complex problems, two or three types of data are involved, and we might need to use and combine two or three different types of data science techniques.
This is why a platform such as Watson Studio where different types of tools, including Decision Optimization, are available, will help you handle these problems.
See this post to understand how Decision Optimization integrates into Watson Studio.
com/in/alain-chabrier-5430656/@AlainChabrier.. More details