(Hint: If it can’t, this post, and the next, won’t exist.
)In this post, we’ll discuss on the strengths and weaknesses of Agile in the context of Data Science.
At risk of irritating agile practitioners, I may refer to Agile and Scrum interchangeably.
Nonetheless, do note that Scrum is an agile process framework, and there are others such as Kanban, etc.
In the next post, I’ll share some agile adjustments and practices that have proven to be useful — at least in the teams I’ve led.
Stay tuned!Data science is part software engineering, part research and innovation, and fully about using data to create impact and value.
Aspects of data science that work well with agile tend to be more of the engineering nature, while those closer related to research tends not to fit as well.
What aspects of agile work well with data science?TL; DR:Planning and prioritisation at the start of each sprintClearly defining tasks with deliverables and timelinesRetrospectives and Demos at the end of each sprintPlanning and Prioritisation at the start of each SprintIn most of my past teams, sprints are usually one or two weeks long, and we’ve found this to be a good length.
Each sprint starts with a planning and prioritisation meeting which helps to align the data team with the needs of the organization.
Planning and prioritisation begin with stakeholder engagement.
Scrum provides for explicit prioritisation with stakeholders and provides the framework to have a good overview of the tasks planned (and delivered), as well as their associated complexity and effort needed.
Having regular planning and prioritisation meetings provide (internal and external) stakeholders a better understanding of the costs associated with each data science effort, and the overhead associated with frequently changing priorities and context switching.
This ensures alignment between the data team and its stakeholders, with stakeholders being conscientious about their data effort budget, and the data team being aware of organisational needs and how they can effectively contribute.
Such planning and prioritisation helps the data team to practice one of the seven habits of highly effective people — “First things first”.
Clearly defining tasks with deliverables and timelinesOne common issue faced by data science projects is a lack of focus, or getting derailed by investigations that go down the rabbit hole.
This is partially due to the innate curiosity drives most data scientists, and partially due to the ill-defined nature of data science problems.
Defining tasks beforehand with clear timelines help to mitigate this issue.
Having a clear, expected deliverable for each task aligns with one of the seven habits of highly effective people — “Begin with the end in mind”.
When approached with a new request, it helps to have the data science lead, or someone with more experience, to help define the tasks and deliverables.
For example, if trying to understand why net promoter score (NPS, a measure of customer experience) went down, the expected deliverables could include analysis on various aspects of customer experience, such as:Delivery (e.
, timeliness, package arrival condition)Product (e.
, product ratings, reviews, price)Customer service (e.
, waiting times, number of touch points, customer service ratings)App metrics (e.
, spammy notifications, slow loading times, confusing UI).
This would help narrow down the causes for the drop in NPS.
Next, we can assess the impact of lower NPS on the business.
Do customers with lower NPS spend less (i.
, cart size, purchase frequency, absolute spend)?.Are they less active on the app or have they turned off notifications?.Are they at risk of attrition?Defining these questions and hypotheses upfront provide milestones for data scientists as they conduct their analysis.
In addition, sharing these tasks with the stakeholders can elicit useful information and feedback based on their expertise.
The process is similar for building data products, where most projects have a similar flow:Data extraction: A minimum set of denormalised data across the organization’s data sourcesData preparation: Consistent formatting, lower cased strings, nulls filled, outliers and seldom occurring values handled.
Feature engineering: Label/one-hot encoding, normalisation/scaling of continuous variables, various additional feature engineeringValidation: Setting up the framework to validate (i.
, random sampling, time-based sampling); defining the right machine learning and AB testing metricsMachine learning: Assessing multiple models quickly, deciding the most suitable techniques, parameter tuning, more feature engineering, ensemblingMVP and demonstration of results to stakeholders: Expected improvements to current metrics, expected effort and cost of production, roadmapsAB testing: Traffic splitting and sampling; sample size and power consideration, collection of AB testing results and dataThe above examples only list some of the tasks required at a very high level.
A natural question from stakeholders will be — “how long will it take?”.
Data scientists with a few years of experience can usually give a fairly accurate estimate of the effort required.
Nonetheless, this may vary based on the environment (e.
g, infra, security, bureaucracy), data quality, and skills of the data scientist(s).
Take for example, the development of a data product — should it take two years?.If it’ll improve organizational outcomes by 10x, perhaps.
If the improvement is 10%, maybe not, though it depends.
Thus, setting clear timelines before the start of the project, based on the estimated value of the project, helps set the right context for the data science team.
Depending on the timeline, whether it’s 6 weeks or 6 months to build an MVP, the data science team can allocate effort appropriately.
Retrospectives and Demos at the end of each sprintTwo rituals I especially enjoy are the retrospectives and demo sessions at the end of each sprint.
Their aim is to help the team learn from each other, celebrate our achievements, and get feedback on how to do better for the next sprint.
Considering that each takes about 30–60 minutes yet contribute so much to team growth, satisfaction, and well-being, they have very high return on investment (of time).
At each retrospective, the team reflects on the past week’s sprint.
There are many ways on how this can be done, but here’s an approach I’ve found to work.
Everyone fills up the whiteboard with points on what they found:Enjoyable: What aspects of the sprint and tasks did they enjoy?.What were some achievements that we should celebrate?Frustrating: What aspects of work were frustrating?.Were these challenges more of the technical nature?.Or business nature?.Or politics?.What can we do to improve?.What were the learnings from it?Puzzling: What puzzled you in the course of the week?.Has anyone else on the team encountered it before?.Are there any ideas on next steps?If the retrospective is done weekly, it helps the team to grow and gain from each sprint.
Given a 5% improvement from each weekly retrospective, after a year, the team will be 1.
05 ^ 52 = 12x better!For the demo session, the team gets together to share significant milestones completed in the past sprint(s).
It is not necessary for everyone to demo every week — usually, demos are done after a significant chunk of work, or a specific milestone, which can take anywhere between 2–8 weeks.
At the demo, the team can learn from each others’ experiences, as well as provide feedback.
This greatly helps with team development, where a bunch of great people continuously develop and grow through learning and feedback from the people around them.
This also helps to increase the bus factor, and helps more junior members of the team to level up on the more advanced methods, or gain context on the organization and data.
In addition, demos promote accountability within the data science team, where people strive to demo something periodically.
Inviting the larger organization to the demo also promotes better understanding of data science efforts and ideas on how the data team can help with the organization’s goals.
What aspects of Agile make it hard to apply in Data ScienceTL; DR:Data Science efforts are more ill-defined and thus more difficult to estimateScope and requirements may change very quicklyExpectations that Data Science sprints should have deliverables like engineering sprintsBeing too good/disciplined at ScrumData Science efforts are more ill-defined and thus more difficult to estimateData science problems are ill-defined relative to engineering problems — this makes estimation harder.
For example, when a problem is provided, it is not always straightforward which data should be used.
Once the dataset is decided upon, how much effort is needed in data exploration, cleaning and preparation, feature engineering, assessing multiple models, and then achieving the target metric?. More details