Strong correlations can evaporate when we decide to process the data differently.
High-performing classifiers may just be reflecting our tacit assumptions or poor sampling.
For these and many other reasons, it is dangerous to commit investment based on the early successes of a project.
This is why we also need to talk about our negative results.
If we want our partners in business to trust data science, we need to show them (not just tell them) how the ongoing process of hypothesis rejection allows us to find the right model.
A healthy desire to disprove our own initial results is how we acquire a deep understanding of the data landscape, its pitfalls and surprises as well as the hidden gems.
A recent project of mine nicely illustrates this point.
The hypothesisA car manufacturer asked me to analyze a dataset of standardized defect codes logged by engineers during testing.
The same codes were being generated automatically by the onboard computer after a car entered service.
My client wanted to know whether the defects found during testing were predictive of defects occurring in service.
The data modelTo take a concrete example, imagine a pre-delivery testing program for a certain model of car.
The engine can raise 50 possible defects, the electrical system can raise 50 possible defects, and the hydraulic system can raise 40 possible defects.
The testing lab typically finds only 2 or 3 defects, and fixes them on the spot.
Most new car owners encounter no problems, but sometimes the car breaks down or raises a warning soon after delivery.
We can model the defects experienced by customers within a given horizon (e.
3 months) as a vector of frequencies: out of the 140 possible defects, which ones occurred in service and how often?In order to look for predictive relationships in the data, I constructed these frequency vectors for each vehicle and summed them over all vehicles that encountered a specific defect during testing (Fig.
I then looked for unusual vectors in this co-occurrence matrix using principal component analysis (and found a few).
But the client was also interested in finding frequent patterns of in-service defects, which could potentially be resolved by improving the company’s testing procedures.
A portion of the co-occurrence matrix showing the most common defects.
Each row of in-service defects (right axis) aggregates the total counts of all vehicles that raised a particular defect during testing (left axis).
A vehicle with multiple testing defects can contribute to more than one row.
Hence, the rows are not fully independent.
The color thresholds are powers of two (blue = 1, green ≥ 2, yellow ≥ 4 …).
The frequency vectors were very sparse, even after aggregating vehicles, and strongly resembled a small corpus of TF-IDF document vectors.
Since I didn’t have very many vectors to compare, I tried using Latent Dirichlet Allocation to look for patterns of co-occurring defects (i.
, LDA topics).
Initial resultsWhile 90% of the vectors were very similar to each other and to the average defect distribution, the remaining 10% formed four distinct clusters in LDA topic space.
My client was excited — had we found frequent patterns of client problems, perhaps explainable by a flawed process for testing some vehicle subsystems?.However, I was not ready to validate the result, because I knew that the aggregated frequency vectors in the dataset were not fully independent.
Proving myself wrongI spent another week improving data pre-processing, simulating defect frequency vectors with a Poisson process to estimate the expected number of spurious correlations (Fig.
2), and examining the individual cars that went into the clustered vectors.
One of the clusters shrank to just two members, and two others turned out to be composed of vectors that shared data from a single highly defective vehicle.
The one remaining cluster only appeared when I focused my analysis on rare in-service defects, and while it contained several independent vectors, the similarity between them was weak.
Distribution of correlation coefficients (absolute value) between the vectors of the co-occurrence matrix (blue).
Correlations in a simulated matrix where defects are generated by a Poisson process (orange).
The similarity of the two distributions undermines the hypothesis that we can find frequent patterns.
In fact, the excess of strong correlations is due to the fact that some rows of the matrix are not fully independent.
In the end, the most important outcome of this analysis was a rejection of the initial hypothesis: while we discovered a handful of unusually reliable or defective vehicles, there was no discernable relationship between the defects encountered during testing and the occurrence of in-service problems.
From a business standpoint, this result is encouraging, because it suggests that the current test procedures are very good.
Publicizing the negative resultBy quickly disproving the hypothesis of frequent patterns, I could tell my client to look elsewhere for value generators.
Time, attention, and analysis are expensive, so he was still pleased.
Now he could make correct choices going forward.
Disproving a hypothesis is often easier than proving it, and by “failing fast” in testing the ideas of domain experts, we can deliver insights quickly and build up trust.
Any data science department will encounter many cases like this one.
Business units bring their ideas for generating value, and it is our responsibility to filter them quickly so that only viable concepts advance to the next stage.
Furthermore, we have a duty to question our own ideas and results.
We must try our best to prove ourselves wrong, no matter how elegant or attractive our ideas seem.
In short, our negative results show that we are not just application developers or machine learning hackers — we are hypothesis testers, and this activity is what puts the ‘science’ in data science.