Machine Learning in RGuillermo Martínez EspinaBlockedUnblockFollowFollowingFeb 19Data Preprocessinghttps://www.
com/photos/mikemacmarketing/30212411048During the past weeks I have been working with Machine Learning in R and Python and also taking several courses.
One thing I have noticed all my programs have in common, is preprocessing the data in order to apply Machine Learning models.
Most of the time, the data preprocessing process is divided into the following steps:Importing the dataset.
Completing missing data.
Encoding categorical data.
Splitting the dataset.
Importing the DatasetThere are several ways to import a dataset.
The simplest one is importing the dataset from a .
In order to do that you have to do the following:First you have to set your working directory,#setwd("~/Desktop/ML/Project1")setwd("<directory_where_your_dataset_is_located>")Once you have stablished the working directory, you have to import the dataset,dataset = read.
csv')The command read.
csv('filename') receive different optional parameters, you will have to use some of them depending on how your dataset is arranged on the .
You can set the sep parameter to indicate the separator on your file.
For instance,dataset = read.
csv', sep = ';')# sep = ';' indicates that the separator between each data is ;Completing Missing DataCompleting missing data is optional.
If your dataset is complete you obviously will not have to do this part.
But sometimes you will find datasets with some missing cells, in that case, you could do 2 things,Remove a complete row (not recommended, you could delete crucial information).
Complete that missing information with the mean of the column.
Take the following incomplete dataset,As you can see there are some missing cells, one in the Age column and another one in the Income column.
In order to fill those missing cells with the mean of each column you have to do the following,dataset$Age = ifelse(is.
na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.
rm = TRUE )), dataset$Age)dataset$Income = ifelse(is.
na(dataset$Income), ave(dataset$Income, FUN = function(x) mean(x, na.
rm = TRUE )), dataset$Income)We have checked if there is an empty cell on each of the columns.
If there is one, then the empty cell will be replaced with the mean of the column.
The output is the following,Now that the data is completed we can go to the next step.
Encoding Categorical DataThis step is also optional.
Depending on your dataset, you might have from beginning on, a dataset with already encoded categorical data.
In that case you won’t need to do this.
In our case, we have the Graduate column, this column has 2 possible values, either yes or no.
In order to be able to work with this data, we have to encode it, that means, changing the labels to numbers.
Doing this in R is really simple, you just have to do the following,dataset$Graduate = factor(dataset$Graduate, levels = c('yes', 'no'), labels = c(1, 0))The output is the following,Splitting the DatasetThis part is mandatory and one of the most important parts when working with Machine Learning models.
Splitting the dataset means that you have to divide the whole dataset into two parts, the training set and the test set.
When you want to train a model to solve or predict an specific thing, you first have to train your model and then test if the models is doing a correct prediction.
Normally the proportion is 80% training set and 20% test set, but it can vary depending on your model.
We will split the dataset with that proportion.
You first have to install a package called caTools by doing the following,packages.
install('caTools')Once installed you have to tell R that you will use that library,library(caTools)The next step is creating a seed that will help to randomize how the data will be splitted and then proceed splitting the dataset.
To do so, type the following,#Creates a seed, you can type any number, not just 123set.
seed(seed = 123)#SplitRatio indicates the size of the training setsplit = sample.
split(dataset$Purchased, SplitRatio = 0.
8)training_set = subset(dataset, split == TRUE)test_set = subset(dataset, split == FALSE)Now that the data is splitted we can proceed to the last step.
Feature ScalingThis last step is also not always necessary.
In the dataset there are some values that are not on the same scale, for example the Age and the Income have a very different scale.
Most of Machine Learning models work using the euclidian distance between two points, but since the scales are different, the distance between two points could be enormous and it could cause problems on your model.
Some models handle this already, so you do not have to do it by yourself, but some other models require you to scale your features before.
In order to scale our data we have to run the following code,training_set[, 1:2] = scale(training_set[, 1:2])test_set[, 1:2] = scale(test_set[, 1:2])As you can see, R has a function that scales the selected columns, in our case we are scaling all the rows from the first two columns.
What about the encoded categorical data?.Do we need to scale it as well ?Some people say that it is useful to encode categorical data, some other people say it is not necessary.
What I’ve experimented is that it doesn't matter that much, it is up to you.
ConclusionAfter you have done all this data preprocessing steps a few times you will notice some of this steps can be omitted if your data is well prepared from the very beginning.
Why are all this steps important?.One of the most crucial parts of Machine Learning is having a well prepared and trustworthy dataset, preparing your information in the right way is a step further into having a good Machine Learning model.
.. More details