We had the possibility in our hands to analyze huge amounts of data, different types of data and in real time too, which was awesome.
But the problem was that even though we tried our best to govern our data and have everything tight, it wasn’t easy.
Most of our data lakes transformed into data swamps:https://www.
com/blog/blogdata-lake-vs-data-swamp-pushing-the-analogy/This is not uncommon.
Even though we have ways to improve the way we use our data lake and really govern it, it’s not easy to get the data we want, when we want it.
That’s why when I’m working with companies, the thing I hear the most is:We have a lot of data, I just don’t know where.
It should be here somewhere…Data it’s normally in silos, under the control of one department and is isolated from the rest of the organization, much like grain in a farm silo is closed off from outside elements.
It’s time to stop that.
Remember:To extract value from data, it must be easy to explore, analyze and understand.
Towards the Data FabricIf you’ve been following my research you may remember my definition of the data fabric:The Data Fabric is the platform that supports all the data in the company.
How it’s managed, described, combined and universally accessed.
This platform is formed from an Enterprise Knowledge Graph to create an uniform and unified data environment.
And there are two important points I want to make here, the data fabric is formed by the enterprise knowledge-graph and it should be as automated as possible.
To create a knowledge-graph you need semantics and ontologies to find an useful way of linking your data that uniquely identifies and connects data with common business terms.
The key here is that instead of looking for possible answers, under this new model we’re seeking an answer.
We want the facts — where those facts come from is less important.
The concept of the data lake it’s important too because we need a place to store our data, govern it and run our jobs.
But we need a smart data lake, a place that understand what we have and how to use it.
We have to make an effort to be able to organize all the data in the organization in one place and really manage and govern it.
In order to go to the data fabric universe we need to start thinking about ontologies, semantics, graph databases, linked-data and more to build a knowledge-graph and then find a way of automating the process of ingesting, preparing and analyzing data.
You can read more about how to start building a data fabric here:The Data Fabric for Machine Learning.
Part 2: Building a Knowledge-Graph.
Before being able to develop a Data Fabric we need to build a Knowledge-Graph.
In this article I’ll set up the basis on…towardsdatascience.
comConclusion: Data science in the Data FabricThe ultimate goal of using data is making decisions from it.
Data science does that: we have data and after the data science workflow we should be able to make decisions from the analysis and models we created.
So far I’ve written two pieces on how to start doing machine learning (ML) and deep learning (DL) in the data fabric:The Data Fabric for Machine Learning.
How the new advances in semantics can help us be better at Machine Learning.
comThe Data Fabric for Machine Learning.
Part 1-b: Deep Learning on Graphs.
Deep learning on graphs is taking more importance by the day.
Here I’ll show the basics of thinking about machine…towardsdatascience.
comBefore doing that we need to break our “data silos” and harmonizing organizational data is necessary to find new insights and unlock our data’s full potential.
What we actually need it’s a graph-based system that allows data analysis.
Usually that’s called a Graph- Online Analytical Processing (OLAP).
A Graph OLAP (like Anzo) can deliver the high level of performance enterprises need for big data analytics at scale, and in combination with Graph Online Transaction Processing (OLTP) database (like Neo4j, Amazon Neptune, ArangoDB, etc.
) you have a great way to start building your knowledge graph.
After you successfully created a data fabric, you will be able to do one of the most important parts of data science workflows: machine learning, as ML in this context is:The automatic process of discovering insights in the data fabric by using algorithms that are able to find those insights without being specifically programmed for that, using the data stored in the it.
Remember also that insights generated with the fabric are themselves new data that becomes explicit/manifest as part of the fabric.
Insights can grow the graph, potentially yielding further insights.
So the process of doing data science inside the data fabric it’s much easier, as we have a whole system that stores, and automates data ingestion, processing and analysis, that also enable us to find and explore all the data available in the organization in an faster and clearer way.
No more weird data and huge queries to get a simple value it’s one of the goals too.
There are some examples of data fabrics around us that we don’t even know.
Most successful companies in the world are implementing and migrating their systems to build a data fabric and of course all the things inside of it.
I think it’s time for all of us to start building ours.
Thanks for reading this.
If you have any questions please write me here:Favio Vazquez — Founder / Chief Data Scientist — Ciencia y Datos | LinkedInJoin LinkedIn ‼️‼️ Important Note: Due to Linkedin technical limitations, I can now only accept connection requests…www.
comHave fun learning :).