Adopting a Hypothesis-Driven Modeling WorkflowMichael BradyBlockedUnblockFollowFollowingJun 3Many of the algorithms used as the foundation of modern predictive modeling use an iterative process to arrive at an optimized solution.
For some models types, like boosting, each iteration is informed by the results of the prior iteration.
As a model builder, I seek to implement a workflow that mirrors those of the predictive algorithms that form the core of my models: constant iteration informed by hypotheses.
Constant iteration allows for complexity to slowly be layered on top of a basic baseline model (e.
, additional features, scaling, imputing, tuning).
An iteration assess one targeted set of changes to the model.
This process allows the model-builder to assess the impact of each set of changes on the overall model (retaining only those changes with positive impact).
Each iteration seeks to assess the validity of a hypothesis held by the model-builder.
The core job of the model-builder is to form and test hypotheses.
Sources of hypotheses include prior experience (what has worked in the past?), domain expertise, exploratory data analysis, and intuition.
When building predictive models a frequent trap I fall into is too quickly adding complexity to my models without following the iterative building process described above.
Adding complexity without iteration means that I lose a granular understanding of which of my hypotheses are helpful, which are irrelevant (and thus possibly adding noise), and which are detrimental.
This often leads to a number of problems:Slower Iteration: The model is computationally expensive to run reducing my ability to iterate/work quicklyLow Understanding: I can’t explain what gives my model predictive power reducing my ability to optimize.
Moreover, the model is frequently unnecessarily complex (e.
, a simpler model would achieve similar performance)Fewer Ideas: I have reduced ability to generate good hypotheses as I am not getting as much feedback from the model (which often serves as a source of good ideas)To demonstrate the power of a workflow that adheres to the ideal of hypothesis-driven iteration, the below visualization details the iterative process I used to build a simple model to designed to predict if water pumps located in Tanzania are functional, functional but in need of repair, or non functional.
This dataset is available as part of an ongoing modeling competition hosted by DrivenData.
As the below graphic details, over a series of 10 iterations my model improved from the majority class baseline of 54% accuracy to ~78% accuracy.
As shown, a relatively simple set of iterations results in a predictive model with ~78% accuracy (compared to the naive baseline of 54%).
As part of my workflow, for each iteration I detail the:Hypothesis: What belief am I testing with this iteration?Action: What action (change to the model) am I taking to test the hypothesis?Result: How did this impact the model?Insight: Was the hypothesis correct?.How does this iteration inform future iterations?The below screenshot captures the table I used to support the creation of the water pump status predictive model:Google SheetsTo be honest, my first attempt at building this model did not adhere to the theory of layering complexity into the model through constant iteration.
Rather, I infrequently iterated, resulting in many changes between iterations reducing my ability to assess the validity of each change.
While the performance of the model was similar, it was overly complex and computationally expensive to iterate (slowing down my ability to test hypotheses).
Further, I did not have a strong sense of why the model was performing well.
Due to the complexity of my original model, I eventually had to start over to increase my ability to experiment quickly.
Adopting the process described above ultimately enabled a higher performance, lighter, model.
In summary, while it is tempting to quickly add in multiple layers of complexity to a model — embracing constant iteration will provide a deeper understanding of the model, generate more creative hypotheses, and result in a simpler, lighter, model.
.. More details