Top 10 Statistics Mistakes Made by Data Scientists

The model you built looked great in R&D but performs horrible in production.

The model you said will do wonders is causing really bad business outcomes, potentially costing the company $m+.

Its so important all the remaining mistakes bar the last one focus on it.

Solution: Make sure youve run your model in realistic outsample conditions and understand when it will perform well and when it doesnt.

Example: In-sample the random forest does a lot better than linear regression with mse 0.

048 vs ols mse 0.

183 but out-sample it does a lot worse with mse 0.

259 vs linear regression mse 0.


The random forest overtrained and would not perform well live in production!  You probably know that powerful ML models can overtrain.

Overtraining means it performs well in-sample but badly out-sample.

So you need to be aware of having training data leak into test data.

If you are not careful, any time you do feature engineering or cross-validation, train data can creep into test data and inflate model performance.

Solution: make sure you have a true test set free of any leakage from training set.

Especially beware of any time-dependent relationships that could occur in production use.

Example: This happens a lot.

Preprocessing is applied to the full dataset BEFORE it is split into train and test, meaning you do not have a true test set.

Preprocessing needs to be applied separately AFTER data is split into train and test sets to make it a true test set.

The MSE between the two methods (mixed out-sample CV mse 0.

187 vs true out-sample CV mse 0.

181) in this case is not all that different because the distributional properties between train and test are not that different but that might not always be the case.

  You were taught cross-validation is all you need.

sklearn even provides you some nice convenience functions so you think you have checked all the boxes.

But most cross-validation methods do random sampling so you might end up with training data in your test set which inflates performance.

Solution: generate test data such that it accurately reflects data on which you would make predictions in live production use.

Especially with time series and panel data you likely will have to generate custom cross-validation data or do roll-forward testing.

Example: here you have panel data for two different entities (eg companies) which are cross-sectionally highly correlated.

If you randomly split data you make accurate predictions using data you did not actually have available during test, overstating model performance.

You think you avoided mistake #5 by using cross-validation and found the random forest performs a lot better than linear regression in cross-validation.

But running a roll-forward out-sample test which prevents future data from leaking into test, it performs a lot worse again!.(random forest MSE goes from 0.

047 to 0.

211, higher than linear regression!)  When you run a model in production, it gets fed with data that is available when you run the model.

That data might be different than what you assumed to be available in training.

For example the data might be published with delay so by the time you run the model other inputs have changed and you are making predictions with wrong data or your true y variable is incorrect.

Solution: do a rolling out-sample forward test.

If I had used this model in production, what would my training data look like, ie what data do you have to make predictions?.Thats the training data you use to make a true out-sample production test.

Furthermore, think about if you acted on the prediction, what result would that generate at the point of decision? The more time you spend on a dataset, the more likely you are to overtrain it.

You keep tinkering with features and optimizing model parameters.

You used cross-validation so everything must be good.

Solution: After you have finished building the model, try to find another “version” of the datasets that can be a surrogate for a true out-sample dataset.

If you are a manager, deliberately withhold data so that it does not get used for training.

Example: Applying the models that were trained on dataset 1 to dataset 2 shows the MSEs more than doubled.

Are they still acceptable.

?.This is a judgement call but your results from #4 might help you decide.

  Counterintuitively, often the best way to get started analyzing data is by working on a representative sample of the data.

That allows you to familiarize yourself with the data and build the data pipeline without waiting for data processing and model training.

But data scientists seem not to like that – more data is better.

Solution: start working with a small representative sample and see if you can get something useful out of it.

Give it back to the end user, can they use it?.Does it solve a real pain point?.If not, the problem is likely not because you have too little data but with your approach.

Bio: Norman Niemer is the Chief Data Scientist at a large asset manager where he delivers data-driven investment insights.

He holds a MS Financial Engineering from Columbia University and a BS in Banking and Finance from Cass Business School (London).


Reposted with permission.

Related: var disqus_shortname = kdnuggets; (function() { var dsq = document.

createElement(script); dsq.

type = text/javascript; dsq.

async = true; dsq.

src = https://kdnuggets.



js; (document.

getElementsByTagName(head)[0] || document.


appendChild(dsq); })();.. More details

Leave a Reply