Data science productionization: portabilityPortability decreases the time it takes to get value from your code by decreasing the amount of code you have to rewrite when your goals change.
Schaun WheelerBlockedUnblockFollowFollowingMar 21This is the second part of a five-part series on data science productionization.
I’ll update the following list with links as the posts become available:What does it mean to “productionize” data science?PortabilityMaintenanceScaleTrustThe first step to productionizing data science is to make it portable.
To explain what I mean, let’s look at a simple example of code portability:The above code performs a simple task that is commonly found in text analysis: take a word, remove all white space on the ends, and lowercase all the characters; then examine each character of that word in order; if the character is in a list of “stops”, omit it; at the end, join all the remaining characters together so they are again one word.
The example uses a specific word — ‘abracadbra’ — and you can see that the result contains only the characters ‘rr’, as all of the other letters were in our list of stops.
Now look at the same code with a few changes:The new code takes the same input and produces the same output.
But there are a few ways this code is preferable to the previous version:It upper-cases the variable `STOPCHARS`, following the Python convention for flagging constants — it tells someone reading this code that if they want to use that part of the code, they’ll need to copy and paste it, as it is not derived from any other source.
It is a visual signal telling a programmer how to use that part of the code in a different context.
The entire workflow is wrapped in a function.
This function can take an arbitrary word and an arbitrary list of stops.
If I wanted to change those things using the first code example, I would need to go up and erase the word definition I had and write in a new definition.
With the new code, I can just call the function with new inputs — I don’t have to erase any previous work; I can just add in new work.
I introduced a default value for the stops in the function definition.
If no stops are fed into the function, it will remove white space, lowercase, and return the word without needing to spend any time looking at each character individually.
So it can automatically handle situations where I want to normalize a word but don’t have any stops to remove— the one function handles both use cases automatically.
In short, the revised code can be used for any number of projects — even projects I never anticipated when I wrote the function.
The original code can be used for the project for which I wrote it, and must be edited if I want to use it anywhere else.
So the second code example is the more portable of the two.
Types of portabilityWe can also make whole collections of code portable.
There are several ways to do this:Packaging.
Packaging is the process of taking a lot of code — multiple files and directories — and bundling them together for easy installation on other computers.
Packaging systems for Python, for example, range from PyPi, which is widely used and relatively minimalist, to conda, which is growing in popularity due to the ways it makes installation easier and more robustly handles dependencies.
Dependency management is one of the major reasons for packaging software — modular software is often built on top of other modular software, which means when you want to use one you need to have all installed.
Keeping track of those dependencies through packaging makes code portable across users.
Also code can perform quite differently depending on the dependencies with which it has been packaged.
For example, the graph below shows the difference in number of images that can be processes per second using tensorflow:https://www.
com/tensorflow-in-anaconda/The code packages in Conda performs 4–8 times faster than the code packages in PyPi.
That is due largely to a difference in packaged dependencies.
Packaging makes for portability across users, while version control makes for portability across contributors.
The most popular version control system is git.
Git operates on a fork-modify-commit-merge model.
In the image below, the top line is the “master” branch of the code, and the remaining lines are other branches created for different purposes.
com/git/tutorials/comparing-workflows/gitflow-workflowWhen you see code you want to use, you “fork” your own copy.
You can mess that copy up as much as you want — the original will still be right where it was.
When you make changes to the code on your local machine, you can commit it to your fork — your local code will reflect any changes you make, but your fork will only reflect changes you commit, while the original code will remain as it was originally.
If your fork gets to the point that the owners of the original code find it useful and stable and generally something to be desired, you can merge your fork with the master branch, after which point that master will reflect all of the changes you committed to your fork.
Version control allows the same basic collection of code to be worked on by many different people at the same time, enforces rules about what can become the “official” version of the code, and provides both an audit trail and a basis for disaster recovery, all by packaging code into separate branches and versions.
Sometimes it’s not enough to package the code itself.
Containerization systems like Docker take care of dependency management by not only listing your code’s dependencies, but actually shipping those dependencies — system tools, other libraries, and settings — with the code.
It’s like packaging up your entire machine so it can be handed off to and unpacked by someone else.
Sometimes, your code itself doesn’t need to be portable, but the processes that use that code do need to be.
Tools like Airflow allow you to specify the timing and order in which various scripts run, and specify where and howthe outputs of that process should be stored.
This is actually a huge part of data science productionization and it doesn’t get enough attention — if you try to run every process from beginning to end each time you need it (or each time you need to update it), you’ll either have to run a huge amount of processes in parallel, which can get expensive, or you have to run certain processes less often, which can run counter to your business needs.
By chopping processes up into intermediate pieces and storing those mid-stream results, you save computation time by being able to point downstream processes to those products rather than to the scripts that produced them, and you get an audit trail for your data.
And, like the rest of your code, you can pick up those automated workflows and move them to different servers or point them at different databases without having to re-write your whole process.
Portability decreases the time it takes to get value from your code by decreasing the amount of code you have to rewrite when your goals change.
An exampleLet me give one example from my own experience.
This is Madison Square Garden in New York City — Penn Station is underneath:Madison Square GardenHere are two satellite images of the same location:Satellite images of Madison Square GardenThe first satellite image shows all of the mobile-device location signals we got for a single day over Penn Station.
The second image shows only those locations that were visited by over 100 unique mobile devices in that one day.
At first, it seems unrealistic to have hundreds or even thousands of devices to show up in single 10-centimeter squares over just 24 hours — that’s what the right-hand image shows — but it might actually make some sense.
The middle of Madison Square Garden is over the main waiting area for the Long Island Railroad.
People sit in dedicated places in waiting areas, they congregate around open electrical outlets, etc.
In short, I can’t say that those locations are obviously unrealistic.
Even those locations running down the avenue next to the square hotel east of the station fit with what I know about taxi waiting patterns on that street.
So I spent a few months developing several methods for differentiating artificial locations from simple busy locations.
In many cases, the locations the methods flagged weren’t surprising.
For example, we already knew that the geographic center of the United States is used as a dumping ground for companies when they don’t know where an ip address is located.
The centers of states and postal codes seem to be used similarly.
But other patterns were unexpected:The grid pattern on the left is typical of high-density areas: centers of urban environments, airports, amusement parks, and things like that.
When high buildings of multiple conflicting signals make it difficult to get a good GPS reading, many phones will revert to an older, less-precise method of location reporting, resulting in gridding.
The grid pattern on the right, on the other hand, isn’t actually a grid — it’s a hexagonal mesh used by the Weather Company to report forecasts.
Weather apps tend to report locations in this pattern.
We didn’t know any of that originally — we discovered the patterns, and then had to do some research to try to discover the data-generating processes.
We had an immediate business use for the pattern-recognition models we developed: we could flag problematic locations so we could ignore them.
That way, our location reporting would be based only on locations we could trust.
Because we made that process portable, we could easily ingest new sources of location data and apply our filters automatically — the process that worked for one dataset worked for another.
Because we version-controlled out process, we could make adjustments to our filters — add in new processes or modify old ones — without breaking any of our downstream processes.
And because we automated the workflow and dropped results from these filters into intermediary products that stored the full results, we could use those results for new purposes.
For example:The above images show a bunch of alternative possible locations — ones not flagged by our filters — that could potentially be used in place of a single untrustworthy location.
Without going into too much details about the specific implementation, you can see alternative locations aren’t randomly scattered across the world.
In fact, they’re scattered more or less across a single small town.
So we can replace a bad location with a bounding box that gives us a rough idea of where devices reporting that location actually are.
We even developed a cost-benefit algorithm that gave us the best balance of minimizing bounding box size while maximizing the number of trustworthy location signals used.
Often, we can reduce the bounding box size until it focuses on a single home rather than a single town, and still use most of the information we have about alternative locations.
We made that cost-benefit-balancing process portable and soon found an unanticipated use for it:We’re interested in understanding when people visit specific locations — in particular, store visitation is important to our clients.
In the images above, the dots represent the centroids of individual stores within buildings.
The red dot is the store we were targeting in this instance.
We only get location signals when people view apps or websites where an ad is being served, and that doesn’t always happen within the actual store.
It may happen on the sidewalk outside, or in the parking lot.
Also, GPS signals are inherently noisy, so a phone in one store may report some of its signals in the store next door.
So we’ve worked on a way to tell the probability that any particular tile of ground indicates visitation to a particular store.
You can see all of the blue tiles of ground that we’ve attached to the store, along with the score for that association.
We needed to tell how many tiles we should attach to the store.
We used the same cost-benefit-balancing algorithm we used before (the results of which you can see below the store images) to save weeks of work using an established process in new contexts.
That, to my mind, is the main benefit of making data science processes portable: it drastically reduces the amount of time it takes to go from recognizing a problem to deploying a solution to the problem.
In a world where deadlines matter, the best tool is often the tool you’ve already built.
At the very least, the tool you’ve already built lets you get a solution out the door, which give you the time to look for ways to improve that solution.
.. More details