Everything your grandmother wanted to know about the world of data and never dared to askJaime DuránBlockedUnblockFollowFollowingApr 12Data Science, Big Data, Data Lakes, Artificial Intelligence, Data Mining, Machine Learning, Deep Learning, Business Intelligence, Business Analytics … There has been a lot of writing and talking about these over the last years.
And as usual, when something becomes fashionable, the concepts, definitions and limits gradually fade away.
In this article I’m trying to put some order, explaining who is who in a simple way.
If these things mentioned above are not clear for you… stay with me!Available in Spanish too | Artículo disponible también en españolLet’s see if I understand it… [Danh Vo on Unsplash]If you have an account on Linkedin and use it regularly, you may have noticed it’s becoming more and more complicated to know what the hell a person is doing just by reading their headline.
We’ve filled everything with flamboyant technical words, and acronyms that perhaps we are tired of seeing, but surely we have not bothered to look for their meaning.
AHA…“But you … what do you do exactly?”And this is when the fun begins…If the question is asked by someone who works on the same role, you can explain it in a lot of detail (or you should be able to do it).
If it’s done by someone from the same sector, you may be give an explanation assuming the other person knows all the technologies you work with, and you’ll probably fail.
If the question comes from someone who is dedicated to something different: that’s when you’ll be really put into test.
Especially if the other person starts asking you questions…What if you had to explain it to your grandmother?And this is where I wanted to go.
There is a “famous” quote (wrongly attributed to Albert Einstein) that says:“You don’t really understand something until you can’t explain it to your grandmother”Another variant uses a 6-year old instead of the grandmother, but if we are talking about technology you’ll allow me to go with the first one.
A couple of years ago, my colleague Antonio Calderone came up with the idea of applying this, in order to explain everything we were doing at the Digital Transformation Unit, to the commercial managers of our company.
And to make them able to retell it (the broken phone).
I had to speak about Big Data, Business Analytics and a little about Machine Learning.
The truth is that it wasn’t easy for me… although I want to think they got the basic ideas.
Well, I’ll repeat the same exercise here, clarifying the broader concepts related to the world of data, and trying to make it suitable for all audiences :)Data Science (DS)A simple definition: Data Science is the set of skills and techniques applied to extract useful knowledge from data.
This set of skills is often represented with a Venn diagram, created by Drew Conway (or one of its variants):We have three circles representing three differentiated fields.
On the one hand, we have the field of programming (knowledge of a language, its libraries, design patterns, architecture, etc.
On the other hand, there are mathematics (algebra, calculus, …) and statistics.
Last, and not least, the domain of the data (knowledge of the particular sector; health, finance, industry, etc).
These fields come together, giving rise to the skills and techniques from the definition.
Here we have things like getting the data, its cleaning, its analysis, the creation of hypotheses, the algorithms, machine learning, optimization, the visualizations to present results, and a long etcetera.
Data Science brings together these fields and skills, enabling and improving processes for the extraction of insights and knowledge from raw data.
com/gapingvoid]And what’s “useful knowledge”? the one that adds some kind of value; answering a question or solving a problem from the real world.
Data Science could also be defined as the field that studies and applies progress in the treatment and analysis of data, to give us solutions and answers.
Big DataThis is going to be the easiest one: Big Data is simply a huge amount of data; and nothing else :)Multiply data and you’ll get more ponies!To define Big Data, it’s common to explain it with the 3 V’s, which are the 3 main causes involved in its origin:Volume: The amount of data collected absurdly grows every minute, and we have the need to adapt our storage and processing tools to that volume, using distributed solutions (use of multiple machines, instead of one very — VERY — expensive supercomputer / mainframe).
Speed: The urgency required for the data to be processed, is linked to the frequency of its generation / acquisition, and the need to use them in decisions making as quickly as possible; even in real time (or almost).
Variety: The data is no longer (only) structured, so we have to forget that everything can be fitted in a traditional database.
We must be prepared to add new data sources, with all kind of formats; ranging from plain text to multimedia contents.
As time passed, more V’s were added: veracity (the data must be authentic, credible, and available), value (the data must have value for the business or for society) and vulnerability (the data must comply with legality, respect privacy, and be stored and accessed in a safe way).
Big Data would be the set of solutions trying to address all these problems.
Do not confuse it with the first concept explained in this article: Big Data is everything that enables or facilitates the application of advances in the field of Data Science, when the nature of the data demands it.
Example: we, as data scientists, are trying to get answers from a dataset, which not only exceeds the size of our RAM, but also exceeds the size of our hard drive.
Big Data provides us with distributed storage technologies to host data across several machines, and also distributed processing technologies to handle them in parallel.
Data LakeA Data Lake is a centralized storage repository, used to store data of all kinds: structured (the data we used to put in tables, perfectly defined), semi-structured (data that follow a format where almost everything fits: CSV, logs, JSON, XML, etc.
), and unstructured (documents, e-mails, PDFs, images, video, audio, etc.
It doesn’t matter if data is generated internally or outside our business.
Being “centralized” implies everything is going to be stored in the same place, and everyone will access there to obtain data.
This doesn’t imply that all the data is in the same machine or within the company; distributed storage will almost be used as a rule, and data could also be in the cloud.
com/]Do not overlook a crucial detail: data is stored in raw format (the original one), without any modification.
This implies that no information is lost for any future analysis; data will only be processed and transformed when it’s used.
Besides that… what would be the point of cooking the fishes before putting them in a lake? :)Artificial Intelligence (AI)“Can machines think?”In 1950, Alan Turing formulated this question, and even created a famous test to evaluate if the answers given by a machine were similar to those that a human could give.
Since that, fantasizing about artificial intelligence began, with focus on the imitation of human behavior.
Did you ever take that test yourself?Oh, wait!.my intention was not to tell you the History of Artificial Intelligence …We’ll return to the concept itself.
Artificial Intelligence isn’t Blade Runner’s replicants, or Battlestar Galactica’s Cylons.
We can define an artificial intelligence as any machine or software with some kind of intelligent behavior.
And what is considered intelligent behavior?Good question!.this is the point where we don’t agree.
… As machines develop new capabilities, there are tasks previously considered as intelligent, taken out from the AI environment.
For example, when the amazing Deep Blue defeated Garry Kasparov in a chess match, and its creators explained how it really worked, the poor girl went from being the smartest one, to being qualified even as “dumb” (with a great brute force, that’s true).
And to top it off, uglyLet’s define artificial intelligence as any machine or piece of software capable of correctly interpreting data from its environment, learn from them, and use the acquired knowledge to carry out a specific task, within a changing context.
Examples: A car that parks by itself is not considered intelligent; it simply measures distances and moves following a routine.
A car able to drive autonomously is considered intelligent, since it’s capable of making decisions based on what happens around (in a totally uncertain environment).
The field of Artificial Intelligence encompasses several branches, which are currently in full apogee.
It’s convenient to visualize them to know exactly what are we talking about:[ https://www.
com/ ]Data MiningData Mining is the art of finding some interesting (and not obvious) patterns, using data exploration techniques.
What patterns do we refer to?.things like: the way in which data can be grouped based on certain features, anomalies detection (infrequent values), the dependence between some observations and others, a succession of certain events, the identification of behaviors, etc.
Data Mining uses, among other things, Machine Learning methods.
Machine Learning (ML)Machine Learning is the most important branch of Artificial Intelligence.
Its task: the research and development of techniques enabling the machines to learn by themselves, in order to execute a specific task, without explicit instructions from humans.
The machine will learn from an input data set (known as sample or training data), building a mathematical model based on the patterns detected by an algorithm.
The ultimate goal of this model is to make (accurate) predictions or decisions on the data arriving afterwards from the same sources.
Within classical Machine Learning, there are two main types:Supervised learning: when the training data is “labeled”.
This means that, for each sample, we have the values corresponding to the observed variables (the inputs) and the variable we wanna learn to predict or classify (the output, target, or dependent variable).
Within this type we find the regression algorithms (those predicting a numerical value) and the classification algorithms (when the output is limited to certain categorical values).
Unsupervised learning : when the training data is not labeled (we don’t have a target variable).
The goal here is to find some kind of structure or pattern, for example to group the training samples, so we’ll be able to classify future samples.
Classic Machine Learning has given way to more sophisticated or modern aspects :Ensemble Methods: basically it’s the joint use of several algorithms to obtain better results by combining their results.
The most common example is Random Forests, although XGBoost has become very famous because of its victories in Kaggle.
Reinforcement Learning: the machine learns from trial and error, thanks to the feedback it gets in response to the iterations with its surrounding environment.
You may have heard about AlphaGo (world’s best Go player) or AlphaStar (capable of crushing us in Starcraft II).
Deep learning : the crown jewel…Deep learning (DL)As we’ve just seen, Deep Learning is a sub-field within Machine Learning.
Cats again… [https://blogs.
com/]It’s based on the use of artificial neural networks.
An artificial neural network is a computational model, with a layered structure, formed by interconnected nodes that work together.
They have that name because of their inspiration on (or its attempt to simulate) biological neural networks, which we find in our brains.
aiAlthough neural networks have been studied and used over many years, advances within the field have been very slow until recently; mainly limited by the lack of computational power.
Deep Learning didn’t start until the past decade, although it has experienced a great boom in recent years, thanks in part to the adoption of GPUs for the neural networks training.
There is an extended belief out there: any Machine Learning problem, however complicated, can be solved by a neural network, simply by making it bigger enough.
Nowadays, a lot of advances are being made in the rest of fields of Artificial Intelligence because of the progress in Deep Learning; both in the more traditional ones (improving the results obtained), and in the most trendy ones: natural language processing, artificial vision, speech recognition, generation of realistic multimedia content, etc.
Business intelligence (BI)This term refers to the use of data within a company, helping their managers in decision-making.
BI tools (reports, dashboards) tell us what happened, and decisions based on that will therefore be reactive.
A random dashboardBusiness Analytics (BA)It’s the evolution of traditional Business Intelligence, taking advantage of the advances in Big Data that enable companies to explore and interact with a greater amount of data, of any kind, and coming from more sources; all of this (almost) in real time.
It also makes use of improvements in the field of Data Science, so discoveries made from the data will be much more valuable.
The BA tools inform about what happened and what is happening; but they also predict what will happen, and even simulate what might happen, depending on the actions we make.
Decisions taken may therefore be proactive, rather than reactive.
com/ ]The idea behind BA is that the whole company can benefit from these discoveries, implying better (and faster) decisions in all areas.
And that’s all! I hope everything is more clear now… isn’t it? :)It’s over using “Big Data” for everything.