Photo by Samuel Zeller on UnsplashHow to use Test Driven Development in a Data Science WorkflowAnother thing Data Scientists and Machine Learning Engineers should learn from Software DevelopersTimo BöhmBlockedUnblockFollowFollowingMar 18Every software developer knows about Test Driven Development (or TDD for short) but not enough people in data science and machine learning.
This is surprising since TDD can add a lot of speed and quality to data science projects, too.
In this post, I walk you through the main ideas behind TDD and a code example that illustrates both the merit of TDD for data science and how to actually implement it in a Python-based project.
What is Test Driven Development?TDD is an evolutionary approach to software development.
That is, it relies on incremental improvements, which goes along well with agile processes.
The easiest way to understand TDD is the “Red, Green, Refactor”-system based on the working model proposed by Kent Beck in 2003:Red: write a new test and make sure that it fails.
If it passes, the code base already covers the required functionality and does not need additional work.
Green: write code that passes the test.
Most importantly, all previous tests have to pass, too!.That is, the new code adds to the existing functionality.
Refactor: revise code if necessary.
For instance, make sure that the structure of the code base is on the correct level of abstraction.
Do not add or change any functionality at this stage.
You can also think about these steps as finding answers to different questions:Red: How can I check whether my code delivers a specific functionality?Green: How can I write code that passes my check?Refactor: What do I have to change in the code base to improve it without affecting the functionality?There are numerous advantages of this approach compared to other methods:Writing tests forces you to think about what scenarios users might create later on.
A good test covers what the piece of software is supposed to deliver given a particular input or user behavior more generally.
You need to write more code, but every piece of it is tested by design.
The overall quality will therefore increase.
Thinking in this paradigm promotes the development of clearly defined modules instead of overcomplex (and hard to maintain) code bases.
When (not) to use TDD in Data ScienceHopefully, I convinced you that TDD is a great idea for software development.
Given its approach, when can we apply these principles in data science for the most substantial effect?TDD is probably not worth the effort in the following scenarios:You are exploring a data source, especially if you do it to get an idea of the potential and pitfalls of said source.
You are building a simple and straightforward proof of concept.
Your goal is to evaluate whether further efforts are promising or not.
You are working with a complete and manageable data source.
You are (and you will be) the only person who is working on a project.
This assumption is stronger than it might appear at first glance but holds for ad-hoc analyses.
In contrast, TDD is great in these cases:Analytics pipelineComplicated proof of concept, i.
different ways to solve a subproblem, clean data etc…Working with a subset of data, so you have to make sure that you capture problems when new issues come up without destroying working code.
You are working in a team, yet you want to make sure that no one breaks the functioning code.
TDD Example: Tweet Preparation for NLP TasksFor this example, I used pytest instead of unittest from the standard Python library.
If you look for an introduction to the latter, see at the bottom of this post for a link.
To walk you through the TDD process, I chose a simple but still realistic example: preparing a list of tweets for further analysis.
More specifically, I want to base the analysis on clean and unique tweets.
This requires code for four subproblems:Clean tweets from mentions of other accounts.
Filter out retweets.
Clean special characters from the tweets.
Filter out empty strings.
I deliberately ordered them in a way that is not ideal later on.
The reason for that: a number of stand-alone tests can cover each of these tasks.
The combination and order of these tasks is a separate step later on.
Since I knew from the beginning about these four tasks, I started by creating test cases for all of them.
That is, I came up with exemplary tweets that exemplify all of these problems.
To have them easily available, I created a @pytest.
Think about these functions as flexible placeholders for input values of your test cases.
Here is the associated code snippet:All my code is part of a larger tweet_project module that includes a tweet_cleaning file with all the functions relevant to this example.
Let’s start the process with Red:This test fails, since all theclean_mentions function contains right now is a pass .
Therefore, it returns None instead of an empty string.
Now it is time for Green, that is writing code the passes the test.
In my example, I used a regular expression to delete the “@” and everything afterward up to the next space:Now the test passes.
Is there anything to refactor right now?.Nothing that directly impacts functionality.
I used the same method for the other three steps.
Here are the tests for them:You can see that detect_ functions return a boolean value that can be used as a filter later on.
The respective function I wrote to pass these tests look like this:The four functions are entirely independent right now.
Each of them has a dedicated test that ensures they work as expected.
To conclude this example, let’s build a test that checks whether the whole pipeline produces the desired outcome.
That is, I want to add the functionality that takes a set of tweets, cleans what needs cleaning, filters out what is not useful and returns a set of tweets ready for further analysis.
The previous test cases do not sufficiently cover the potential scenarios.
That’s why I implemented a new pytest.
ficture for that.
You can also see that my test covers two essential characteristics of my desired output.
First, it should return only one tweet from the new test set.
Second, I need to make sure that the result is a list and not a set or string (or something entirely different) so that functions further down the road can rely on that.
The red stage is successful because the test fails.
In the green stage, I used existing (and therefore tested) functionalities and combined them so that the test passed:This piece of code works, but the need for refactoring is obvious.
It is part confusing and part ugly.
I encourage you to have a shot for practice.
However, whatever you do from now on, all tests have to stay green.
You may not add additional functionality before starting the next cycle.
ConclusionI want to emphasize that TDD can only be as good as the tests written by the programmer or data scientist.
Therefore, it is crucial to think about which scenarios are likely to happen.
For instance, there was no test case with two mentions in the example above.
There are also some assumptions in there such as that retweets in the data always start with RT.
However, these are not limitations of TDD but the result of human beings working on complex problems.
The advantages of TDD prevail:Each development step in itself is tested, and it is easy to understand what the test contained.
Since every step builds on previous tests, it is way harder to break things unnoticed.
This approach reduces the need for debugging significantly.
There is a clear way on how to add additional functionalities to an existing code base: extend an existing test or add a new one.
Working in a TDD framework encourages explicit thinking and makes it very unlikely to end up in dead ends or utterly confused.
I acknowledge that data science is not the same as software development.
However, thinking like a developer from time to time is a powerful thing to do.
Let me know in the comments or on Twitter if this post helped you or if you want to add something.
I’m also happy to connect on LinkedIn.
Thanks for reading!Additional Material:For an introduction to TDD with unittest in Python, I recommend this blog post by Dmitry Rastorguev:A simple introduction to Test Driven Development with PythonI am a self-taught beginning developer who is able to write simple apps.
But I have a confession to make.