3 steps to a clean dataset with Pandas

Only drop it if you’re quite sure it won’t be helpfulHere’s how to do all of those things in Pandas:(2) Handling missing valuesWe’ve already dropped the feature variables with a high percentage of missing values.

Now we want to handle those feature variables that we do actually need but also have missing values.

Again we have a few options:Fill in the missing rows with an arbitrary valueFill in the missing rows with a value computed from the data’s statisticsIgnore missing rowsThere first one can be done if you know what a good default value should be.

But if you can compute a value from some kind of statistical analysis that is often highly preferred since it at least has some support from the data.

The last option can be taken if we have a large enough dataset to afford throwing away some of the rows.

However, before you do this, be sure to take a quick look at the data to be sure that those data points aren’t critically important.

(3) Formatting the dataWhen datasets are collected, the data will often be entered in by human users as plain text.

This can cause complications with the data format.

For example, there are many ways to enter in the name of the state of California: CA, C.

A, california, Cali; these will all need to be standardised into one uniform format.

In addition, there may be cases where the data is continuous and we want to make it discrete or vice versa.

Standardising data format including acronyms, capitalisation, and styleDiscretising continuous data, or vice versaLet’s do that in Pandas:Like to learn?Follow me on twitter where I post all about the latest and greatest AI, Technology, and Science!.

. More details

Leave a Reply