Bring Dynamo to the Data Science PartyAd Hoc Querying of Dynamo Tables via Athena in a Jupyter NotebookAlex QuinteroBlockedUnblockFollowFollowingMar 29Dynamo is a great datastore for many applications but one of the major things that is lacking is the ability to query across your entire table efficiently.
This article is going to show one way this can be accomplished.
If you are already familiar with Dynamo and are here just to learn about how to perform efficient ad hoc queries you can skip to the “Making the Dynamo table queryable” section.
Why We Chose DynamoBefore we get deep into the meat and potatoes of this post I’d like to take a minute to talk about Dynamo generally.
Dynamo has actually become the default datastore for our microservices, or as I like to call them, our right-sized services at Imagine Learning.
There are many reasons for this, but here are the primary reasons we have chosen Dynamo.
Incredibly Fast: Not only Does Dynamo provide single digit millisecond response times, but Dynamo actually performs better while under load.
During our peak usage hours, we actually see Dynamo response times go down, which is impressive to say the least.
You also have the option to use DAX if you’re into microsecond response times.
Easy Scalability: Because Dynamo is built with scaling in mind, all you have to do is configure the table for on-demand capacity or set up auto scaling for your provisioned capacity.
You don’t even have to really think about shards or consider the fact that everything is spread out across multiple nodes.
You do need to make sure that you use a good hash key that is evenly spread out but generally, it just works.
Cost: The crazy thing is that Dynamo is very cheap in comparison to relational databases.
Hosted databases on RDS for example are at least 5x as expensive for similar capacity.
Part of this is because you don’t have to think about the hardware at all and pay for machines that are sitting there doing nothing during off hours (insert serverless buzzword).
As with every technology it comes with trade-offs.
DynamoDB is no exception.
Here are some things that aren’t so great about Dynamo.
Up front knowledge of access patterns: You have to understand during the design phase how the table will be accessed to ensure you can query the table efficiently.
Limited transaction support: Transactions are available, however, there are some pretty significant limits with this.
Refer to this article from AWS for more details.
Queryability: As with other single table designs you can only efficiently query based on the hash (partition) key in combination with the sortkey.
That does limit you somewhat but local secondary indices and global secondary indices can help you here.
Of course, you could just always do table scans… but I’ll assume you know better.
Don’t worry there are options here, more on this later.
Learning curve: As engineers we are very much used to thinking and designing a relational database.
We are accustomed to normalization optimizing for storage efficiency.
With Dynamo you have to shift your thinking, denormalization and even intermixing data to reflect relationships is sometimes the best approach.
It’s a mind bend and might be a bit uncomfortable at first.
In my opinion the benefits far outweigh the downsides.
If you still aren’t convinced, or even if you want to know more specifics of design patterns you can use with Dynamo to accomodate different data types, I highly recommend the following video.
So there you have it.
Dynamo is fast, scales and is cheap.
Seems like a good tool to have in your box.
Making the Dynamo Table QueryableAlright, with that out of the way let’s talk about ad hoc querying or really just querying in general.
I mentioned that table scans are generally a bad idea and that this gap exists in the Dynamo service by itself, but let’s talk about how we can work around this limitation.
The basic idea is that you dump your Dynamo table into an S3 bucket at a time when it’s convenient.
Then you make that table available for querying via Amazon Athena.
The amazing thing about Athena is that, again, you only pay for what you use when the queries are performed.
That’s it, serverless at it’s finest.
No super expensive data warehouses laying around.
Did I mention that you can query your data now using mostly standard SQL?.Athena is based on Presto DB and that gives you the ability to not only use standard SQL but adds a bunch of extra functionality.
One example of that extra functionality is making it possible to deal with JSON inside of your table in a pretty simple fashion.
Naturally, it isn’t quite that simple, there are some details to be had in here.
For example, the data needs to be converted to parquet format for efficient querying.
A Glue Crawler needs to run to automatically detect the schema of the data and then make it available to Athena.
Full details on how to do this can be found in this outstanding article by Goodreads.
They even have cloud formation templates you can utilize to get the entire infrastructure up in minutes!.Full credit goes to them for helping the rest of us along.
Now that your data is accessible via Athena there are a lot of things you can do with it.
You can integrate it into SageMaker or something similar to build a machine learning model.
You could create a series of ETL jobs to dump the data somewhere else where you could utilize your BI tool of choice (Power BI, Tableau, etc).
If you just need to answer one question then you certainly can just use the Athena interface inside the AWS console.
Another option, which I will spend the rest of this article outlining, is how to integrate your new-found query powers into a Jupyter notebook with some simple visualizations to boot.
Jupyter Notebook with PyAthenaIf you’ve done anything related to machine learning or data science lately you likely have some familiarity with Jupyter notebooks.
This is a great platform for building some basic queries and performing some simple analyses.
Granted this is likely easy for you as the developer to consume but it may not be the right platform if you were trying to make the data available to an entire company.
Regardless, let’s get to it.
First thing to do is get a Docker container running with the right tools.
For this example we’ll take the SciPy based Jupyter container and simply add PyAthena.
Now we can build our Docker container with docker build -t scipy-athena:dev .
Feel free to give it whatever tag you’d like.
Now that we have a container we can use a docker-compose file to configure the environment variables and such that we’ll need.
We want to have environment variables for our AWS credentials so they aren’t saved into our notebooks.
You’ll definitely want to update the path on line 11 to reflect a location that is available locally on your machine.
Docker-compose does make it easy to define the environment variables by simply creating a .
env file in the same directory as the docker-compose file.
In this example it would look like this:Alright.
Now that you’ve updated your .
env file with the relevant details we can get into the notebook.
Start the container with the docker-compose up command.
Look in the console output for a url and copy that and put it into a browser with necessary edits.
Create a new notebook and paste the following as your first step.
That will import some required dependencies and also create our Athena connection.
From there you can now perform queries inside the notebook and use Pandas and Seaborn or your other favorite data science tools to analyze your data.
One of the beauties of this approach is that you can do this at any time knowing that you are having absolutely zero impact on your production database because this is a copy of all of your data.
Insert love for data lakes here.
Alright, I won’t go into exhaustive detail on how to query Athena as a quick Google search can give you better advice than I can.
Keep in mind though that the Presto documentation is very helpful along with the AWS Athena documentation.
Here is an example of a simple query.
This particular service allows an educator to create a list of activities that can be assigned to a student.
Here you can see an Athena query (looks sort of like familiar SQL) being performed and put directly into a Pandas dataframe.
The frame is simply printed out with total_length and then Seaborn is utilized to show a basic visualization of the table.
Pretty easy right?.You’ll notice in that query a few PrestoDB functions like json_array_length and json_extract.
The actual row in Dynamo contains some full JSON objects that are serialized as strings.
Those functions make it possible to get into the data even though it’s placed inside the column as JSON.
One more example that is a little bit more complex.
Again, you don’t have to understand everything going on to see the power of using this approach.
This is going to look at how much time is being spent inside of these educator created playlists.
One neat trick used here is to extract elements in the JSON payloads into their own rows.
That is the CROSS JOIN UNNEST magic.
You can see a tutorial in more detail on that from this medium article.
You’ll also notice that Pandas offers easy bucketing of data using the resample method to group all of the data by months.
Last part here is to show the data visually with Seaborn.
Notice that there was a significant dip in the month of March in playlist usage.
There are multiple explanations for why this might be but the insight was gained by taking a close look at the data provided by the service backing this feature.
ConclusionIf you made it this far, congratulations, you must be a completionist.
Performing this type of data analysis is extremely useful and can aid greatly in making better data informed decisions with your services.
You can reap all of the benefits of Dynamo as mentioned above and have your data analytics too.
Sounds a lot like having your cake and eating it too.
Photo by Will Echols on UnsplashBut don’t take my word for it, the cake might be a lie.
Go try it out for yourself!.. More details