Big Data: The Art & Science That is Revolutionizing the WorldDive Deep Into the Technology that Will Improve Drastically Every Aspect of Our LivesVictor RomanBlockedUnblockFollowFollowingJun 5Big Data: The Why, How and WhatIn the past recent years, information technologies have experienced a drastic boom:Sensors are now ridiculously cheap.
Computational power has incremented tremendously.
There are internet connected devices everywhere (smartphones, activity braces, smartwatches, TVs, cars… Even Thermomixes have wifi now!)These factors and others have caused a drastic increase in available data.
Today, we are able to produce, store and send more data than ever before in history.
So much so, that back in 2015, it was estimated that 90% of the available data had been created in just the previous two years.
And since then, the rate of data generation has only increased.
In fact, it is estimated that currently there are 2.
5 quintillions of data generated, every day.
5 x 1⁰¹⁷ bytes or the equivalent to one hundred million times the storage capacity of the first Iphone’s generation.
All this data is commonly known as Big Data.
To be considered Big Data, it must comply with the following characteristics, the so-called 3 Vs rule:It must be data with high Variety.
That comes in incrementing Volume.
And with great Velocity.
The Boom in the Data Science as a Result of Big DataThese vast amounts of data, however, lead to a new problem:What do we do with it?There is so much data available that it is impossible for the human capacity to be able to study and extract valuable information from it.
It will take several lifetimes to analyze and find patterns and insights from this data.
Luckily, computers are there to help us.
In addition, there have been made great advances in the development of Machine Learning algorithms in the past recent years.
These algorithms, added to the almost infinite computational power that we have at so little cost, have caused the drastic expansion of the Data Sciences that we are currently experimenting.
Data is the fastest growing driver to improve business results.
To create a competitive advantage, more and more organizations use their data to increase efficiency, sales, and marketing effectiveness.
But, nowadays the major part of data still disconnected and underutilized.
And here is where Data Sciences comes into play to solve this problem.
The main Data Sciences that are being used nowadays are:Data EngineeringData Engineering focuses on building adequate infrastructures to facilitate the data flow within organizations and in the preparation of this data to be in a useful format.
Data AnalyticsData Analytics focuses on finding useful information from the data.
This branch of Data Science is involved in the descriptive and diagnostic analysis of the data, that explains what happened and why it happened.
It also involves the Data Visualization aspect (which is an entirely separated field)Machine LearningMachine Learning is the science (and art) that focuses on making computers learn from data.
They do this by learning correlations between certain characteristics of past data that lead to certain outcomes, so when they are presented with new data, they can make accurate predictions.
Deep LearningDeep Learning is a subfield of machine learning that focuses on replicating the learning mechanisms that intelligent beings use to learn.
They do this by deconstructing complex concepts in simpler ones, and so, learning in a hierarchical way.
Using Artificial Neural Networks to achieve this.
Data Science Goal and PathAs stated before, data is nowadays hugely disconnected and underutilized and the ultimate goal with this sciences is to be able to go from raw data (that has no value) to wisdom, that will ultimately help decision making, as it will be driven by objective information.
The following picture transmits quite well this notion:In summary, Data Science is a set of fundamental principles, processes and techniques for extracting knowledge from data automatically.
The ultimate goal is to improve decision making and all the tasks should be subordinated to this objective.
The Data Science path, from easier and less valuable to harder and most valuable is the following:Technologies to Work with Big DataThe latter Data Sciences are developed around big amounts of data, but, when the amount is so vast that single computers are unable to analyze it and to extract meaningful insights, is when the parallelized analysis (and Big Data Technologies) comes into play.
The following is an example of a typical Big Data infrastructure.
Some of the technologies used to form and work with a Big Data Ecosystem are:Batch: Hadoop, Hive, Apache SparkStreams: Apache KafkaInfra: Cloudera, Hortonworks, MaprAutomation : Ansible, Chef, Jenkins, Airflow, LuigiContainers & Cluster Management: Docker, Mesos, DC/OS, Kubernetes, MarathonLanguage: Scala, Java, PythonDB: SQL (Yeah you need to be very good at SQL 🙂 ) , NoSql DBs, Timeseries DBsIndexing: ElasticsearchVisualization: Kibana, GrafanaThrough the next series of articles we will focus on the Data Analytics and Machine Learning side of Big Data architectures, concretely, we will work with Apache Spark and its Python implementation: PySpark.
So, if you want to learn more about this amazing technology and how to perform some real-world projects, stay tuned for the next articles!.