Machine Learning — Perfection always starts with mistakes

And as we grow as Data Scientists we will build on this experience and be able to avoid them or identify and remedy them early.

Mistake TypesThe most common types of mistakes in the ML pipeline relate to one or more of the following areas:Data Preparation — Data CleansingIt’s axiomatic to say that ‘dirty data’ is one of the biggest barriers Data Scientists face, and Data Cleansing is the most time consuming part of a ML project taking 60% of the overall time, preceded by 20% of Data Ingestion — a remarkable total of 80% is spent in the initial phase of the project!There is the joke that claims:80% of machine learning is cleaning the data and 20% is complaining about cleaning the data ????Treating missing values is one of the most important tasks of data cleansing and as such can lead to mistakes.

We need to examine the columns with the missing values and see how they relate to the rest of the data set, especially the target values.

A common technique is to use the mean/median/mode of the existing values but it could be the case that this is not the right metric and we need to come up with something else.

Additionally, when it comes to classification we need to consider the class structure of the data set as we can introduce a new ‘Undefined’ category, or another possibility is to use a ML algorithm to predict the missing value.

Finally, we can make a note of these null values and choose an algorithm that can cater for them.

Any mistake here can really distort the final results later on, so it is advised to split the process into individual steps and maybe introduce a combination of strategy or factory design pattern in our code so we can interchange between all these filling methodologies.

Data Preparation — Feature EngineeringFeature engineering is the superset of feature extraction, construction and selection.

Here, Data Scientists use both business experience and data driven insights to identify which columns correlate to the target.

The feature importance can be derived by allocating scores and then ranking them: those features with the highest scores can be selected for inclusion in the training dataset, whereas those remaining can be ignored.

Also we can use this info to construct new features or instead, reduce their dimensionality.

It is key to chose the right features, as having better features means:more flexibility to chose a less complex modelmore flexibility to chose less optimal model parametersbetter resultsAs simple or complex this might be, it is significant to remember that feature selection directly impacts the model selection as we would not want to inadvertently introduce bias into our models which can result in overfitting.

Any mistake on this phase has a direct impact on the model accuracy too.

Here it is prudent to keep a record of all the assumptions we are making, so we can go back and revisit them if a mistake is encountered.

Having extensive documentation will help throughout the project especially when it comes to validation and deployment of the model.


com/1838/Data Segregation — SamplingPrimary errors in this area relate to using a single or limited number of samples which can introduce measurable biases in training and testing the model.

Another type of mistake is not selecting a representative sample from the dataset so the proportion of characteristics/traits are not obtained.



if the population has 35% black items and 65% white items then our sample should reflect this percentage.

Candidate Model EvaluationA usual mistake in this pipeline step is that Data Scientists do not spend enough time evaluating the model but are jumping into using it straight away.

Model evaluation is really important to ensure there are no biases present.

This step goes hand-in-hand with the sampling step: the validation procedure needs to be repeated more than once to yield better results.

And last but not least: Choosing the right modelChoosing the right model for a ML project is a process that requires special attention, as there are myriads of models that can work with our data, but that does not necessarily mean that they are suitable for the problem we are trying to solve.

The main mistakes related to model selection process are:Choosing a model because of:its popularity amongst the data science communityits accuracy (as the only criterion)its speed of returning results (as the only criterion)its ease of use comparing to other optionsFinaleForget the Mistake — Remember the Lesson!Mistakes are an integral part of the learning process — we should embrace them as they are the engine that drives us onwards and upwards.

The more mindful we become about the ML pipeline the less likely we are to make mistakes that can put the whole project at risk.

We can also enhance our learning by thoroughly going through a ML case study observing how it addresses the various steps of the pipeline — and not to mention that:Practice makes perfect!!!Thanks for reading!.

. More details

Leave a Reply