Can we stop with the SQL JOINs venn diagrams insanity?Randy AuBlockedUnblockFollowFollowingFeb 28Really, please, OMG, stopPet peeve time.
I’ve had to teach SQL to non-technical people multiple times over the years and everyone who’s ever tried to learn SQL and failed talk about how “joins are hard and scary”.
If you were to search online for explanations of SQL joins, yeah, it looks insane.
Look at this, just LOOK at it!What really gets me is that the whole mixing of JOINs and Set Theory doesn’t even make sense.
Once your explanation has to include “WHERE A is NULL” in seemingly arbitrary places, people are going to think you have to memorize arbitrary incantations to get things to work — which rightfully sounds HARD!Argh, someone make this stop!Side note: Relational AlgebraYes, I’m aware that SQL has its roots in relational algebra and not set theory.
Effectively, the way I explain things to people yields a similar result, but apparently comes at it somewhat backwards from the formal way of starting from the Cartesian and filtering down.
I don’t have a formal background in relational algebra, so instead of me mangling it hopelessly, you’re welcome to view other sources about the topic.
How I go about teaching JOINs to peopleI explain things as a step by step pseudo-algorithm about how SQL engines yield JOINed data sets, since I feel it lets you grasp some of the more crazy things you can do with a JOIN.
PostgreSQL compatible SQL to generate and test these queries are at the endLet’s say we have two tables, A and B:Data about cities godzilla attacks adapted from hereSuppose we want to do arbitrarily join A to B to get “City names in table A with Godzilla attacks in B”.
(Yes, the tables have plenty of data issues.
For simplicity I didn’t bother with IDs and normalization so I had to use City_name as the joining key.
But all that doesn’t matter in teaching how joins function so I’m ignoring it.
)My goal here is to teach someone who has only seen SQL syntax for just 15 minutes prior to learn to accurately predict what a JOIN will result in consistently.
The Conceptual JOIN AlgorithmFor the generic query:Select _fields_FROM AJOIN B ON on_conditionsWHERE where_conditionsI use the idea of a “working set” where rows that qualify for a JOIN are notionally stored in order to be checked against WHERE conditions later.
For every row in A, compare it to every row in B, one by oneCheck the on_conditions — If on_conditions is TRUE, concatenate the row from A to the row in B and place the concatenated A+B row into a working set — If on_conditions is FALSE, move on to comparing the next pair of rowsFor LEFT/RIGHT/OUTER JOINs — For LEFT JOINs: If for a row in A, no entry has ever been placed in the working set, despite checking every row in B, then place that row from A into the working set, use NULLs in fields that should have had data from B— For RIGHT JOINs: similar to left joins, but for B, if a row in B has no entry in the working set after comparing against all rows of A, insert the row from B into the working set, use NULLs where data that should have had data from AOnce all row combinations have been checked, take whatever is in the working set, and filter it using the where_conditionsApply any GROUP BY, ORDER BY, HAVING clauses as neededGet your result set, you’re doneThis still sounds ridiculously hard!Words are hard.
It’s easier with animations, so I toiled away with Excel and Photoshop to make some for you guys.
If you want to borrow them, feel free.
Warning: This is not how the SQL engine actually works!I’m glossing over a ton of things here, indexes, quirks of NoSQL (like if you’re on Hive and can only equi-join), the many optimizations that happen under the hood to not waste memory and time.
It’s fine, a beginner shouldn’t care about this stuff, the end results match.
INNER JOINSELECT A.
City_name, Godzilla_attacksfrom AJOIN B on A.
City_name = B.
City_nameThis one is simple — iterates through both tables, when the names match up, insert into working set.
A simple INNER JOINLEFT JOINSELECT A.
City_name, Godzilla_attacks from A LEFT JOIN B on A.
City_name = B.
City_nameAlso fairly simple, iterates through all the rows, finds out that Shanghai at the end has no joinable entry in B, adds it in with a NULL.
You can then filter the NULLs in the working set all you want later in the WHERE clause.
A simple LEFT JOININSANE JOINSELECT A.
City_name, Godzilla_attacks from A LEFT JOIN B on (A.
Country = ‘USA’ and B.
Godzilla_attacks = 2) OR B.
Godzilla_attacks = 13This query literally makes no sense.
If you can visualize the behavior of this thing in your head, you’ve pretty much mastered the concept of JOINs.
It takes total abuse of the fact that a JOIN is valid so long as the ON condition evaluates true.
You can abuse this in extremely powerful ways, using things like case statements, arithmetic, and other complex logic.
This is why self joins are totally natural and how you can build histograms easily.
The LEFT JOIN doesn’t even matter because all rows in A can find something to join against.
The engine doesn’t care, it only sees logic symbols, if true then join.
A “Go home you’re drunk” JOINIn ClosingHopefully, someone, somewhere, has been rescued from the venn diagrams with this.
If just one person was saved and has learned how to JOIN better, I’m happy.
Go forth, and write some crazy ON clauses.
ThanksThis post was inspired because a certain someone reminded me how much I hated this:CodeExample DDL and the “crazy” query in DB-fiddle.
create table A (City_name varchar, Country varchar);insert into A (City_name, Country) values(‘Tokyo’,’Japan’),(‘New York’,’USA’),(‘Fukuoka’,’Japan’),(‘Shanghai’,’China’);create table B (City_name varchar, Godzilla_attacks int);insert into B (City_name, Godzilla_attacks)values(‘Fukuoka’,3),(‘Nagoya’,2),(‘New York’,3),(‘Tokai’,3),(‘Tokyo’,13),(‘Yokohama’,2);.. More details