Data science productionization: maintenance

Data science productionization: maintenanceTime spent on maintainability reduces time spent on actual maintenance.

Schaun WheelerBlockedUnblockFollowFollowingMar 22This is the third part of a five-part series on data science productionization.

I’ll update the following list with links as the posts become available:What does it mean to “productionize” data science?PortabilityMaintenanceScaleTrustIn the last post, I used a simple word-normalizing function to illustrate a few principles of code portability:Now let’s look at the same function, but this time prioritizing maintenance:The first part doesn’t even include the function itself.

What I’ve set up here is logging infrastructure.

I’ve designated a file for recording errors (called `error.

log`), and I’ve set up rules about what words to print in that file whenever there’s an error.

Often, logging is preferable to raising a full error because it allows the rest of the system to continue running even when things go wrong.

Downtime is often more damaging than errors or gaps.

Notice also that I’ve added comments (lines that start with a hash sign) to explain what each part of the logging setup does.

Then we get to the function definition (the line that starts with `def`, but again there’s no actual code at that point.

This part is called a “docstring”.

It explains the purpose of the function, the expected inputs, and the expected output.

Now we get to the function content.

First, we have an “assert” statement.

That statement is a test.

If the “word” input isn’t a string, none of the subsequent code will work, so this test checks to make sure the input is a valid type and throws an error if it isn’t.

After that, you can see the actual code of the function, but I’ve wrapped it all in a “try” statement.

This statements first tries to run all of the code as we had it originally written.

If there’s a SystemExit error or a KeyboardInterrupt error (both problems originating outside of the function), the the function just tells me so.

Any other error, the function prints the error in the log file.

That way, I can monitor the log file, see any errors that occur, and get information that helps me to debug the error.

At its most basic, maintainability is about logging, alerting, and testing.

Happily, these things are usually built-in components of a programming language.

Unhappily, they’re really boring to set up and are often allowed to fall by the wayside.

They take a few extra minutes or sometimes hours to set up, but once they are set up, they can save you hours or even days of work in identifying problems.

When we zoom out from specific code and think about the maintainability of entire systems of code, we’re no longer just talking about docstrings and logging infrastructure (although those things are important).

Documentation is the primary non-code way businesses ensure the maintainability of their data science capabilities.

It’s one of those other things that everyone says you should do but that few people actually do.

I think that’s because people often don’t have a specific idea of what value documentation offers.

There are different kinds of documentation, each serving different purposes:Technical documentationTechnical documentation should allow a competent programmer, previously unfamiliar with the code, to write new code that will substantively reproduce the results of the original code.

I say “substantively” reproduce because an exact reproduction often depends on minute details of the implementation — details that can only be found in the code itself.

Standard technical documentation might be organized into the follow kinds of sections:Background.

This is the motivation behind the product — the general business problems it is trying to solve.

It might also include the goals (and non-goals — I’ll write more about those in a second) for the product.


This is the longest part of the documentation.

It sets out, step-by-step, the components of the product.

It identifies all of the data sources to be used as inputs, highlights any dependencies on other products, defines which models, algorithms, or procedures will be used to process the data.

Flags any ways in which the configurations of those procedures differs from standard practices or package defaults.

Flow charts can be helpful in these sections.

Code snippets can often be helpful too.


This section defines specific metrics that measure the performance of the product.

This might be an error distribution or confusion matrix that shows the accuracy of the model; it might be a breakdown of results by customer, region, or some other category.

This section is essentially an argument that your product did what it was meant to do.

Open questions.

This might highlight weaknesses in the product or risks it presents, such as security flaws or use cases not catered to.

Unresolved design or implementation questions, as well as ways to improve the product that would be good to explore but aren’t urgent enough to tackle right away, should be listed here.

Non-technical documentationNon-technical documentation should allow a non-technical user to understand where a product fits into the larger ecosystem of the company’s offerings, and to understand not only the benefits and limitations of the product, but also to understand, at a high-level, why those benefits and limitations exist.

Non-technical documentation might be organized into the following kinds of sections:Business context.

This should briefly describe why the business should expect to find value from the product.


This is the list of specific outcomes the product should achieve.

These should all be things that can be explicitly evaluated — after the product has been live for some amount of time, you should be able to go back through each goal and ask whether it has been accomplished.


This is a list of things that the product is not intended to accomplish.

This is an important aspect of the documentation — it defines the scope of the product.


This should be a radically-honest list of of all the things the product either doesn’t do, or doesn’t do as well as it might.

Stakeholders might soften the language of this list when communicating with, say, external clients, but there should always be a place where the products weaknesses are candidly discussed and not mathsplained.

Interpretation/usage guidelines.

This should give a non-technical user advice for knowing how to responsibly use the outputs of the product.

This might include guidelines on interpreting metrics (for example, what constitutes a “high” score and what constitutes a “low” score?), and changes in settings that could be requested if the stakeholder is unhappy with the results.

Non-technical documentation is particularly important because you have to maintain user engagement with your product as much as you have to maintain the product itself.

A focus only on technical maintenance is like building a beautiful house and then never getting anyone to actually live in it.

A personal exampleWhen I built a data science team for a large charter school network, the most pressing business need was to create summary evaluations of student performance using a lot of different types of data.

In each of our core subjects — English language arts, mathematics, and science — students were assessed every two to three weeks, and each assessment had between one and maybe 40 questions.

Assessment formats ranged from simple multiple choice exams to proctored reading examinations where the student read out loud and the teacher scored their fluency.

Every few weeks, a bunch of school administrators, content creators, and teachers would get together, review spreadsheets of scores, and fight about what the data meant.

Here’s a spreadsheet for a single school, for a single assessment:Each row is a student and each column is a question — yellow cells are cases where the student performed below expectations on a question.

Imagine dozens of these spreadsheets.

Now imagine trying to answer what seems on its face to be a simple question: which students most need our help?When I arrived at my job with the charter school network, it took at least a week to answer that question.

Administrators would debate how relevant assessments from one month ago were.

Teachers would argue that question 2 and 15 from the most recent assessment were written poorly and therefore didn’t give a good read on performance.

That was the challenge my team needed to tackle.

Our main data science product was an algorithm that reduced that process from one week and a committee to an automated result obtainable within a few seconds.

We did this by scoring each question in terms of how well it differentiated historical low-performers from historical high-performers.

So, when looking at an individual question on an individual assessment, if students who historically had failed most questions in that subject through the year got that question right, and students who historically gotten most questions right in that subject got the question wrong, that meant that that particular question was probably poorly constructed.

Poorly-constructed questions should impact overall scores less because students shouldn’t be punished for teachers’ and administrators’ mistakes.

Once we had each question scored, we could average scores over months of work, weighting that average by the question quality scores.

We also incorporated in an exponential decay that gave more weight to more recent data.

All of that allowed us to create the data product that backed these visualizations:Each graph is one school and each dot is a single student.

Students run horizontally from lowest current performance to highest current performance, and vertically from greatest decrease in performance over the course of the year to greatest increase in performance.

The darkest dots are the students who are in the most trouble — their performance is bad and falling.

The deepest orange dots are the students whose performance is good and growing.

School network administrators could use summaries like these to quickly compare schools and see how they were performing relative to one another.

Individual school leaders could use this to quickly identify students who needed more help or needed to be challenged more.

The single biggest factor in the successful adoption of this product was the non-technical documentation that backed it.

In order to get school leaders to use this tool, we needed extensive explanations of how we calculated question quality scores.

We also needed to clarify in painstaking detail the situations where the algorithm was informative.

For example, the algorithm’s purpose was to help with resource allocation — educators only have so much time in the day, so we wanted them spending as much of that time on the students who needed it most.

The students who were the worst performers in the school could still be exceeding government-mandated performance standards.

The algorithm didn’t care about that — it only cared that those students were getting less educational benefit than their peers.

Over time, we had to build out that non-technical documentation to address new concerns about the purpose and scope of the original documentation.

At one point, several school leaders came to me with what they thought was a mistake in the algorithm .

They’d pulled out their spreadsheets and shows that some students who were scored very low on our English-language-arts performance index in fact had very high scores for a lot of individual ELA assessments.

When we dug into the issue, we found then did in fact have high scores on many assessments — all their weekly spelling tests.

They had low scores on their core literacy assessments.

The algorithm was able to recognize that being able to memorize words was a less dependable measure of performance than being able to meaningfully interpret at text.

The more we documented, the wider the tool was adopted and, we hope, the more benefit we were able to pass on to the students.

The system required just as much technical documentation as non-technical documentation.

As demand for the tool increased, we had to change how we stored and retrieved the underlying data.

At one point, feedback from educators convinced us that we’d set the wrong half-life on our time-decay weights so we needed to go in and revise it.

We discovered in our second year of using the algorithm that it gave meaningless results at the beginning of the school year in less assessment-heavy subjects because there just wasn’t enough data.

All of this needed to be codified (in other words, made portable), but it also needed to be documented so our future selves could revise the system without breaking it.

Time spent on maintainability reduces time spent on actual maintenance.

If we don’t build with future maintenance in mind, then we become less able over time to build new things, because we have to spend all of our time keeping the old things running.


. More details

Leave a Reply