Primary, Unique and Foreign Keys and Grouping When Working With Data SetsHarrison HardinBlockedUnblockFollowFollowingJun 11A key skill for analyzing any Data Set is knowing how to group subsets of your data effectively.
In the case of SQL databases, there are different indexes that you often need to uniquely identify various elements within their respective tables.
Working on a recent project involving the European Soccer Database afforded me the opportunity to further examine the difference between Primary, Unique and Foreign Keys.
I was able to practice splitting up some of these groupings based off of Foreign Keys for further analysis.
What follows is a tutorial on all types of database Keys and some grouping techniques through the lens of my recent pairwork project.
First, I’ll touch on differences in the three Key types mentioned above.
When you need to group data for some kind of analysis, it is important to bear in mind a few things about the nature of database keys.
One of the first things you do in the process of cleaning data so that it’s malleable is to check for missing values and deal with those values in an appropriate manner.
If you are ever working with a Primary Key in a Data Set, it is important to realize that missing values (often appearing as Null or NaN — meaning Not a Number — in the Data Sets) are not something you have to worry about with the Key itself.
That’s true for Foreign Keys as well, because Foreign Keys are just Primary Keys that are being referenced in a table other than the one they originate in.
Each table can only have one Primary Key, but there can be more than one of both Foreign and Unique Keys.
Unique Keys are permitted to have one missing value, but not more than that.
This is as true for missing values as it is for other types of values: it is called “unique” precisely because each value must be distinct.
All of this is to say that when you need a Key for your data grouping, the cleaning up of missing values is negligible.
You will have at most one missing piece of data to deal with, and that is only if the type of Key you are working with is a Unique Key.
In this project, one of the directions I took my analysis in involved using a Foreign Key to group games by both home and away team, although I only ended up needing the home team’s games because I was interested in looking at wins at home versus all other game result types (which would consist of away wins, losses at home, home draws and away draws).
The parallel grouping for away games is included in the snippet below just to further reinforce the concept, even though it didn’t figure into my analysis.
games_home = df.
home_team_api_id) games_away = df.
away_team_api_id) The author of the Data Set, Hugo Mathien, sheds some light on the naming convention used here in this quote from the Kaggle page (which is linked to above):“When you have a look at the database, you will notice Foreign Keys for players and matches are the same as the original data sources.
I have called those Foreign Keys ‘api_id’.
”API is an acronym for Application Program Interface, and in this case refers to the resource he referenced in the first place to create that part of the Data Set.
Another example of this could be something like Dark Sky, a weather API that allows you to access historical weather data and forecasts.
I have used this in conjunction with a different Soccer Data Set in another team-based project to make predictions about the outcomes of games, but let’s get back to the example at hand.
Once the home games have been sliced by that Foreign Key, I got the individual teams with the following code:team_ids = list(games_home.
keys()) From there, I made sure that the data provided about goals was complete and that no missing information needed to be dealt with.
Then, it was on to calculating and storing percentages of wins at home.
All other game results were easily calculated as the complement of the home percentages metric by subtracting from 1:#This returns 0 missing values for either columndf[['home_team_goal', 'away_team_goal']].
sum() #So I go on to calculate a new column for victories at homedf['HomeWin']=df.
away_team_goal#Loop through to obtain home wins by teamfor team in team_ids: x=games_home.
HomeWin) Ultimately, the conclusion I arrived at isn’t extremely shocking: playing at home offers a statistically significant advantage at your chances of winning.
The real value was in getting practice working with large SQL Data Sets, grouping by Foreign Keys, and using logic inside of Python functions to gain insight.
When not dealing with Keys, which in a worse case scenario will have a single NaN, there are often lots of missing values that need to be handled in a manner best suited for the task at hand, whether that be simply dropping them, replacing them with the mean or possibly the median.
It depends on what specific type of analysis you’re performing.
If you have several outliers of note, it is best to do median replacement of null values.
This gets into a deeper discussion about levels of imputation.
Sometimes it is important to think about the reason the data is missing rather than jumping to replacing it.
Although outside the scope of this post, there is an even more nuanced approach to use called multiple imputation.
Screenshot from a dbdiagram.
io schema crawler of some of the tables in the Data Set and how they connectThe last morsel of Data Set knowledge I’d like to touch on is how one goes about grouping by categorical data for the purpose of studying each category by statistical metrics.
Let’s take a look at some commented code from my project:#First, set up columns to see who won (same as above), lost or drawdf['AwayWin']=df.
away_team_goal#Then, create a new categorydf['GameOutcomes']=df[['HomeWin', 'AwayWin', 'Draw']]#Group Data Set by result of gamesgame_groups=df.
groupby(df['GameOutcomes'])#Discover insights about the rest of the Data Set, for example by #looking at the mean:game_groups.
mean()What this does in effect is allow you to see the mean value of three categories, a win at home, a loss at home (in other words an away win) or a draw, grouped so as to show how they impact the other variables in the Match table.
You could look at the changes in the mean of your preferred team’s victories by season, or perhaps investigate how their possession differs in wins versus draws (possession is a metric listed below the home_player columns in my screenshot above).
It should be noted that in my analysis of home field advantage, I took this code a step further and added the home win percentage and its complement to a dictionary for each team, and transposed that dictionary in order to make it into its own Data Frame.
These can be seen in the GitHub repository linked in the first paragraph.
com/watch?v=u9nftlpCdXQAnd a special thank you to my partner on the project, Joe.