Visualizing SQL: A Beginners Guide to Relational DatabasesOliver KnockleinBlockedUnblockFollowFollowingJun 14Convolutional Neural Networks, Random Forests, Support Vector Machines, and other cool machine learning models tend to be the bait that lures many scholars, such as myself, into the field of Data Science.
We come with big dreams: to create stock trading algorithms, image classification for autonomous vehicles, and a whole string of other grandiose plans to improve the world through data science.
However, the models are only as good as the data it is based on, which is why being able to extract the desired data will always be a skill which every data scientist, analyst, or statistician should be comfortable with.
This article will act as a short guide to the design of segmented databases, provide a quick and dirty introduction to the theory behind such databases, as well as providing some exposure to Relational Schemas and Entity-Relationship Diagrams.
The problem with traditional methods of data storage, such as Microsoft Excel is two-fold.
The simplest problem is that it is not designed to handle the quantity of information needed by most companies.
Excel would crash immediately if one tried to view a table with millions of rows.
Secondly, there is a real problem with allowing multiple users to access the information simultaneously.
Doing so on Excel would not only slow down the computational power for everyone, but inevitably people would edit over each other.
To get around these problems, the relational database was designed.
Relational Databases try to get around the problems listed above by segmenting the data into smaller, more focused subsets or tables.
This way one can select only the segments of data which one is interested in.
This saves computation time, and also allows multiple users to access the information without crashing the system.
The only issue then is: how to retrieve the desired information from the bunch of segmented tables?The undisputed king of relational database management is SQL, or one of its children such as MySQL.
The actual syntax used by SQL, and all of its derived dialects such as MySQL, is pretty intuitive, resembling broken, but understandable English.
It is therefore not usually a problem writing the code to get the data you want, once you know where it is.
Therein lies a problem though.
Since the datasets are so segmented, locating the desired data, which can be scattered across a multitude of tables often proves much more confusing than the actual coding.
Because of that, this article will not focus on SQL coding, but rather on devices which will allow you to visualize the layout of the data.
The first of these tools is known as an Entity-Relationship (ER) Diagram.
An ER diagram will display the various tables as squares (called entities), with their columns (attributes) displayed as ovals, branching off of the entities in which they are located.
Each attribute must therefore be connected to an entity.
Each entity must have at least one unique attribute, which takes on a unique value in each entity instance.
The unique values can be identified in an ER diagram since they will be underlined in the ovals.
These unique values can then be used to recombine the tables to extract the values you want.
A diamond line then shows the relationship between two entities.
Below we can see a simple ER diagram.
Here we can see that the data consists of two entities: the Student, and the College.
Each of these entities contains multiple different attributes which describe it.
Each entity has a unique key, identified by the underlining of its attribute name.
The different entities can be combined into a larger table by looking at the linking feature, in this case it would read student attends college.
The linking feature also provides some really useful information about the nature of the relationship.
On either side of the diamond we can see first one vertical line, followed by a kind of triangle, or a second vertical lie.
Those show first the minimum, and then the maximum number of occurrences in linking the two entities.
In this case it is telling us that every college has at least one student, and at most multiple students attending.
We also see that every student attends at least one, and at most one college.
Neat right?Of course real datasets are going to be a lot more complex than the simple example, but ER diagrams can still be a useful tool to visualize the nature of the data.
The second method of visualizing the structure of relational databases is through the use of Relational Schemas.
A relational schema will be less visually appealing than an ER diagram, but it provides a more accurate representation of how the data is actually structured.
A relational schema has three important components: tables, primary keys, and foreign keys.
Unlike ER diagrams, the attribute names are just listed in a table, with each table representing an entity.
Each table will contain a primary key whose value is unique for every row in a table.
The names for primary keys will be bolded and underlined.
The next important concept is what is known as a foreign key, which is a column that refers to the primary key of a different table.
Foreign keys are used to link two distinct tables, and are identifiable by a (FK) next to their name in a table.
The various tables can then be linked using foreign and primary keys.
Relational schemas will display arrows between two tables, showing how they are linked, with the arrow pointing from the foreign key in one table to its corresponding primary key in a different table.
Let’s take another look at the student-college example from above.
We can see that the relational schema displays almost the same information as the ER diagram above.
While not as visually appealing, the relational schema allows a more concise representation of the data, which is important when dealing with massive datasets.
Of course, with larger and more complex sets, there will be more tables, with more rows, but the linking mechanism will always be the same.
Relational Schemas and ER diagrams are two of the most popular methods of visualizing the design of relational databases.
Understanding the design of databases might seem dull and far removed from the glamor of building cool models, but without understanding how the database is structured, you will be unable to extract the data you want using SQL, which will mean your model will be unable to be built!.Because of this, I would highly recommend using websites such as ERDPlus.
com to create ER diagrams and relational schemas if they are not provided in the dataset.
Once you have the layout of the database, you can move on to actual SQL queries.
For a more detailed introduction to SQL, I would recommend the coursera course titled Managing Big Data with MySQL.
It does an excellent job at introducing the concepts used by SQL, as well as providing loads of practice resources.
.. More details