From my experience most databases are:Not organizedNot easily accessedNot easily manipulatedNot easily updatedWhen we talk about doing data science.
In older years (like 20 lol) it was easier to maintain a database because the data was simple, smaller and slower.
Nowadays we can save almost whatever we want in a “database”, and that definition I think is stuck with another concept, the relational database.
In a relational database we have a set of “formally” described tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables.
Basically we have schemas in where we can store different tables, and inside of those tables we have a set of columns and rows, and inside of an specific position (row and column) we have an observation.
We also have a relationship between those tables.
But they’re not the most important thing, the data they contain is the most important thing.
Normally they are pictured like this:https://towardsdatascience.
com/what-if-i-told-you-database-indexes-could-be-learned-6cf8f59bff94What is a graph database?https://www.
com/enterprise-it/software/graph-technology-data-standby-every-fortune-500-company/Based upon the concept of a mathematical graph, a graph database contains a collection of nodes and edges.
A node represents an object, and an edge represents the connection or relationship between two objects.
Each node in a graph database is identified by a unique identifier that expresses key value pairs.
Additionally, each edge is defined by a unique identifier that details a starting or ending node, along with a set of properties.
I’ll use an example from people at Cambidge Semantics to illustrate how a a graph database works.
Imagine we have some data that’s stored in a local restaurant chain.
Normally in a relational database you’d store customer information in one database table, the items you offer in another and the sales that you’ve made in a third table.
This is fine when I want to understand what I sold, order inventory and who my best customer is.
But what’s missing is the connective tissue, the connection between the items, along with functions in the database that can let me make the most of it.
A graph database stores the same sort of data, but is also able to store linkages between the things.
John buys a lot of Pepsi, Jack is married to Valerie and buys different drinks.
I don’t have to run JOINs to understand how I should market to each individual customer.
I can see the relationships in the data without having to make a hypothesis and test it.
The people from neo4j mention that:Accessing nodes and relationships in a native graph database is an efficient, constant-time operation and allows you to quickly traverse millions of connections per second per core.
Whereas relational databases store highly-structured data in tables with pre-determined columns and rows, graph databases can map multiple types of relational and complex data.
Thus, graph databases are not rigid in their organization and structure, as relational databases are.
All relationships are natively stored within the vertices of the edges, meaning that the vertices and edges can each have properties associated with them.
This structure allows for a database that can depict complex relationships between unrelated data sets.
Uses of graph databasesDid you know that 2018 was touted as “The Year of the Graph”, as more and more organizations both large and small have recently begun to invest in graph database technology.
So we aren’t on a crazy path here.
I’m not saying that everything we know from relational databases, and SQL will not work anymore.
I’m saying that there are some cases (surprisingly a lot of them) where you are better using a graph database than a relational database.
I’m going to give you right now an idea on when you should be using a graph database instead of something else:You have highly related data.
You need a flexible schema.
You want to have a structure and build queries that are more similar to way people think.
Instead if you have a highly structured data, you want to do a lot of grouping calculations and you don’t have that many relationships between your tables, then you may be better with a relational database.
A graph database has another, not obvious advantage.
It allows you to build a knowledge-graph.
Because they are graphs, knowledge-graphs are more intuitive.
People don’t think in tables, but they do immediately understand graphs.
When you draw the structure of a knowledge graph on a whiteboard, it is obvious what it means to most people.
And then you can start thinking on building a data fabric, which then can allow you to re-think the way you do machine learning and data science as a whole.
But that’s material for a next article.
Implementing a graph database in your companyLike traditional RDBMS, graph databases can be either transactional or analytical.
Choose your focus when you choose your graph database.
For example, the popular Neo4J, Neptune or JanusGraph are focused on transactional (OLTP) graph databases.
While something like AnzoGraph is an analytical (OLAP) graph database.
However, be careful, you may need a different engine for running quick queries that touch upon single entities (e.
What soda does Sean buy?) and analytical queries that poll the whole database.
What is the average price for a soda paid by people like Sean?).
Graph OLAP databases are becoming very important as Machine Learning and AI grows since a number of Machine Learning algorithms are inherently graph algorithms and are more efficient to run on a graph OLAP database vs.
running them on a RDBMS.
Here you can find great resources for different types of graph databases and computing tools:jbmusso/awesome-graphA curated list of resources for graph databases and graph computing tools – jbmusso/awesome-graphgithub.
comThe use cases for graph OLAP databases are vast.
For example one can find key opinion leaders and book recommenders using PageRank algorithm.
Also, conducting churn analysis to improve customer retention or even doing machine learning analysis to identify the top five factors that are driving books sales.
If you want to see more on why and how to implement a graph OLAP take a look at this article:If You're Not Using Graph OLAP, Here's Why You ShouldMore and more businesses and government agencies have started using GOLTP systems to store and drill down on their…blog.
comWhat’s next?The following charts (taken from https://db-engines.
com/) show the historical trend of the categories’ popularity.
In the ranking of each month the best three systems per category are chosen and the average of their ranking scores is calculated.
In order to allow comparisons, the initial value is normalized to 100.
As the data sources continue to rapidly expand (with unstructured data growing the fastest of all), finding machine-based insights is becoming more and more important.
Graph databases provide an excellent infrastructure to link diverse data.
With easy expression of entities and relationships between data, graph databases make it easier for programmers, users and machines to understand the data and find insights.
This deeper level of understanding is vital for successful machine learning initiatives, where context-based machine learning is becoming important for feature engineering, machine-based reasoning and inferencing.
In the future I’ll discuss about how graph databases can help us do machine learning and data science in general.
Related articles:Ontology and Data ScienceHow the study of what there is can help us be better data scientists.
comThe Data Fabric for Machine Learning.
How the new advances in semantics can help us be better at Machine Learning.
comDeep Learning for the Masses (… and The Semantic Layer)Deep learning is everywhere right now, in your watch, in your televisor, your phone, and in someway the platform you…towardsdatascience.
com.. More details