The word “same” is sprinkled in each bullet point.
I smell an opportunity to apply DRY!If you work in a corporate or academic setting like me, you probably do these things pretty often.
I’m going to show you how to wrap all of these tasks into a minimalist R package to save you time, which, as we’ve learned, is one of the keys to your success in the New Economy.
ToolsFirst some groundwork.
I’ll assume if you work in R that you are using RStudio, which will be necessary to follow along.
I’m using R version 3.
1 on a Windows 10 machine (ahh, corporate America…).
Note that the package we are about to develop is minimalist, which is a way of saying that we’re gonna cut corners to make a minimum viable product.
We won’t get deep into documentation and dependencies much, as the packages we’ll require in our new package are more than likely already on your local machine.
Create an empty package projectWe’ll be creating a package for the consulting firm Ketchbrook Analytics, a boutique shop from Connecticut who know their way around a %>% better than anyone.
Open RStudio and create a project in a new directory:Select R Package and give it a name.
I’ll call mine ketchR.
RStudio will now start a new session with an example “hello” function.
Looks like we’re ready to get down to business.
Custom functionsLet’s start by adding a function to our package.
A common task at Ketchbrook is mapping customer data with an outline for market area or footprint.
We can easily wrap that into a simple function.
Create a new R file and name it ketchR.
We’ll put all of our functions in here.
So what we’ve done is create a function that utilizes the tigris package to grab shapefiles for states in our footprint.
The function then unions those states into one contiguous polygon so we can easily overlay this using leaflet, ggmap, etc.
Try your new function out:You get a nice little western seaboard:There is no limit to what kinds of custom functions you can add in your package.
Machine learning algs, customer segmentation, whatever you want you can throw in a function with easy access in your package.
DatasetsLet’s stay on our geospatial bent.
Branch or store-level analysis is common in companies spread out over a large geographical region.
In our example, Ketchbrook’s client has eight branches from Tijuana to Seattle.
Instead of manually storing and importing a CSV or R data file each time we need to reference these locations, we can simply save the data set to our package.
In order to add a dataset to our package, we first need to pull it into our local environment either by reading a csv or grabbing it from somewhere else.
I simply read in a csv from my local PC:This is what the data set looks like:Now, we have to put this data in a very specific place, or our package won’t be able to find it.
Like when my wife hides the dishwasher so I’m reluctantly forced to place dirty dishes on the counter.
First, create a folder in your current directory called “data.
” Your directory should look like this now, btw:Bonus points: use the terminal feature in RStudio to create the directory easily:Now we need to save this branches data set into our new folder as an .
RData file:Now, we buildLet’s test this package out while there’s still a good chance we didn’t mess anything up.
When we build the package, we are compiling it into the actual package as we know it.
In RStudio, this is super simple.
Navigate to the “Build” tab, and click “Install and Restart.
”If you’ve followed along, you shouldn’t see any errors, but if you do see errors, try updating your local packages.
Now, we should be able to call our package directly and use our branches dataset:Cool, that works.
Now let’s plot our branches with Leaflet quick to make sure footprint_poly() worked:Niiiice.
Database connectionsOne of the most common tasks in data science is pulling data from databases.
Let’s say that Ketchbrook stores data in a SQL Server.
Instead of manually copy and pasting a connection script or relying on the RStudio session to cache the connection string, let’s just make a damn function.
Here, we’re building a function that lets us enter any query we want to bang against this SQL Server.
The function creates the connection, prompts us to enter the password each time (we don’t store passwords in code…) and closes the connection when it’s through.
Let’s take it a step further.
Many times you may pull a generic SELECT * query in order to leverage dplyr to do your real data munging.
In this case, it’s easier to just make a function that does just that.
Let’s make another function that pulls a SELECT * FROM Customers.
Ahh, this alone saved me quarters-of-hours each week once I started using it in my own practice.
Think hard about any piece of code that you may copy and paste on a regular basis — that’s a candidate for your packages stable of functions.
Branded ggplot visualizationsOk now we’re getting to the primo honey, the real time-savers, the analyst-impresser parts of our package.
We’re going to make it easy to produce consistent data visualizations which reflect a company’s image with custom colors and themes.
Although I personally believe the viridis palette is the best color scheme of all time, it doesn’t necessarily line up with Ketchbrook’s corporate color palette.
So let’s make our own set of functions to use Ketchbrook’s palette is a ‘lazy’ way.
(Big thanks to this Simon Jackson’s great article).
Get the colorsLet’s pull the colors directly from their website.
We can use the Chrome plugin Colorzilla to pull the colors we need.
Take those hex color codes and paste them into this chunk like so:This will give us a nice palette that has colors different enough for categorical data, and similar enough for continuous data.
We can even split this up into two separate sub-palettes for this very purpose:Create the functionsI’m not going to go through these functions line by line; if you have questions reach out to me at bradley.
lindblad[at]gmail[dot]com, create an issue on the Github repo.
Here is the full code snippet:Let’s test it out:produces:Nice.
Having consistent colors in Ketchbrook’s visualizations will help build awareness for their brand, and faster recognition in the marketplace.
Markdown templatesNow that we’ve fetched the data and plotted the data much more quickly, the final step is to communicate the results of our analysis.
Again, we want to be able to do this quickly and consistently.
A custom markdown template is in order.
I found this part to be the hardest to get right, as everything needs to be in the right place within the file structure, so follow closely.
(Most of the credit here goes to this article by Chester Ismay.
Create skeleton directoryThis creates a nested directory that will hold our template .
Rmd and .
You should have a new folder in your directory called “ketchbrookTemplate”:2.
RmdNext we create a new RMarkdown file:This will give us a basic RMarkdown file like this:At this point let’s modify the template to fit our needs.
First I’ll replace the top matter with a theme that I’ve found to work well for me, feel free to rip it off:I like to follow an analysis template, so this is the top matter combined with my basic EDA template:Save this file in the skeleton folder and we’re done here.
Create the yaml fileNext we need to create a yaml file.
Simply create a new text document called “template.
yaml” in RStudio and save it like you see in this picture:Rebuild the package and open a new RMarkdown document, select “From Template” and you should see your new template available:Sweet.
You can now knit to html pretty and have sweet output like this:If you run into problems, make sure your file structure matches this:├───inst│ └───rmarkdown│ └───templates│ └───ketchbrookTemplate│ │ template.
yaml│ ││ └───skeleton│ skeleton.
RmdWhat’s next?So we’ve essentially made a bomb package that will let you do everything just a little more quickly and a little better: pull data, reference common data, create data viz and communicate results.
From here, you can use the package locally, or push it to a remote Github repository to spread the code among your team.
The full code for this package is available at the Github repo set up for it.
Feel free to fork it and make it your own.
I’m not good at goodbye’s so I’m just gonna go.
I’m available for data science consulting on a limited basis.
Reach me at bradley.
lindblad[at]gmail[dot]comThanks for reading!.