(iterate)STEP 1: Confirm Definitions and CalculationsFollowing the sage advice of my database elders, I created a catalog of all of our properties as well as a list of the queries that drive our existing reports and current OKR metrics.
This documentation became my source of truth during the build.
A simple shared document where teams could track progress in real-time.
I was lucky to join Landed after a period of rigorous property definition and data cleaning, so our starting point for this process was a place of clarity and consensus.
If you’re bootstrapping a model with less clarity, I encourage you to scale the project back until you arrive at the handful of properties and calculations everyone can agree upon (even if, at first, that’s justfirst_name and last_name).
As you build your system out and have some success with your early charting projects (“Look, a list of all our customer names!”), this will inspire others to stabilize their logic and clean up their data so they can jump in the game.
STEP 2: Create Base Tables with Mode DefinitionsWorking from our list of queries I then created base tables for each team.
These tables included only the properties needed to drive the reporting; a curated collection of VIP features.
I defined these base tables using Mode Definitions, which are select statements you can reference in other queries (up to 10 layers of nesting!).
These base tables were defined iteratively — as I wrote each query in Mode, I added the necessary properties I needed from our Redshift cluster.
These base tables ensure our users never query our raw data directly.
Even though I can fully ensure that our users don’t have the ability to mutate the raw data (by virtue of our read-only integration with Mode), we still want a clean record of the data types and calculations serving our analytics, which these base tables provide.
They are also where we define our aliases for each column, as our ETL from HubSpot to Redshift (which we use Stitch to coordinate) pulls in the internal names of our data properties, which differ from the field names our teams engage with.
While most of our team members will not be exploring our SQL queries directly, it’s always a good idea to norm on terminology.
STEP 3: Document tables in dbt .
yaml filesThis is the part of our journey where dbt comes in.
To be honest, it wasn’t entirely clear to me at the onset when we should switch from defining our model in Mode definitions to setting up our tests in dbt and building our materialized views.
This is partly because it’s quite easy to use Mode definitions in your Mode reports, and I noticed a bit of inertia when confronted with the hump of transferring things over to dbt.
When I remembered dbt’s test coverage tooling, I rallied!We arrived at an internal rule of thumb which was to start shifting things over from Mode once we’d covered 90% of the queries in our initial plan, which we hit about three weeks in.
This part of the process involved defining a .
yaml file for each base table that included pithy descriptions for all properties, as well as test definitions, of which there are four flavors in dbt: not_null, unique, relationships, and accepted_values.
Here’s an example:version: 2models: – name: customer_id description: Unique customer identifier columns: – name: uid tests: – not_null – unique – name: best_friend description: Customer's BFF tests: – relationships: to: ref('friends') field: uid – name: favorite_ice_cream_flavors description: Customer's favorite ice cream flavors tests: – not_null – accepted_values: values: ['cherry', 'chocolate']It’s worth noting that Mode supports persistent derived tables (pdt), which replicate the materialized views functionality provided by dbt.
I was drawn to dbt’s active open source community and testing methods, but if you’re already using Mode Analytics for your BI, you may want to look into pdt.
STEP 4: Schedule builds using dbt cloudThis is the stage we’re at now with Landed; scheduling our nightly dbt builds after our ETL integration.
Since we integrate data from our CRM every night, we schedule our dbt cloud job to materializes our base table views directly after.
This part is pretty straightforward, because by this point all the hard work of building and testing the model is done.
One thing to note, if you’re using Redshift you’ll probably want to use late-binding views so that you don’t hit any snags in your nightly build.
Late-binding views bind tables at run-time, so you can drop a build a view that binds to another table without throwing any errors.
I hit this when we set up our ETL with Stitch, and switched to late-binding and now everything runs fine.
STEP 5: Give back to the open source communityWhether picking off an issue for dbt or writing up a blog post about what you built, it’s always nice to round out a build cycle by giving some love back to the community.
Learning with others is one of the greatest joys of engineering, as is supporting projects that advance best practices in the field.
Find a ticket, make some friends through the repo, and submit a PR!Rinse and Repeat!That’s it! Now comes endless cycles of iteration :)Siobhán is the Data Engineering Lead at Landed.
Landed is on a mission to help essential professionals build financial security near the communities they serve, and we invest alongside educators when they’re ready to buy homes in expensive markets.
Sound awesome?.Join us!.. More details