Feature Engineering in Python: Rare values

Plot first the distribution of the target y.

If it is somewhat normal, then the error will be the standard deviation.

Otherwise you can plot the interquantile range.

Remove rare labels: grouping under a new labelOne way of tackling rare or infrequent values is to group them under an umbrella category called ‘Rare’ or ‘Other’.

Now I will group infrequent labels in each of the variables into one category called ‘Rare’ because I want to show you how this affects the performance of machine learning algorithms.

# grouping rare labels into one category# I will replace all the labels that appear in less than 10%# of the observations by the new label 'rare'# first I calculate the frequency of the categories# or in other words, the % of cars in each category for # the variable X1temp_df = pd.

Series(data['X1'].

value_counts() / total_cars)temp_df.

sort_values(ascending=False)temp_df# visualise those labels that appear in# more than 10 % of the carstemp_df[temp_df >= 0.

1].

indexOnly 4 categories are relatively common across different cars.

The remaining appear only in a few cars.

Therefore, how they affect the time to pass the test is difficult to know with certainty.

# let's create a dictionary to replace the rare labels with the# string 'rare'grouping_dict = { k: ('rare' if k not in temp_df[temp_df >= 0.

1].

index else k) for k in temp_df.

index}grouping_dictIf the category appears in > 10% of the cars, then we keep its name, otherwise, we will replace their name by ‘rare’, using the dictionary created above.

# now we replace the categoriesdata['X1_grouped'] = data['X1'].

map(grouping_dict)data[['X1', 'X1_grouped']].

head(10)cols_to_useLet's automate the replacement of infrequent categories by the label ‘rare’ in the remaining categorical variables.

I start from 1 because I already replaced the first variable in the list:for col in cols_to_use[1:]: # calculate the % of cars in each category temp_df = pd.

Series(data[col].

value_counts() / total_cars) # create a dictionary to replace the rare labels with the # string 'rare' grouping_dict = { k: ('rare' if k not in temp_df[temp_df >= 0.

1].

index else k) for k in temp_df.

index } # replace the rare labels data[col + '_grouped'] = data[col].

map(grouping_dict)data.

head()Let’s go ahead and plot the bar plots indicating the % of cars per label and the time to pass the test, for each of the new variables:for col in ['X1_grouped', 'X2_grouped', 'X3_grouped', 'X6_grouped']: # calculate the frequency of the different labels in the variable temp_df = pd.

Series(data[col].

value_counts() / total_cars).

reset_index() # rename the columns temp_df.

columns = [col, col + '_perc_cars'] # merge onto the mean time to pass the test temp_df = temp_df.

merge( data.

groupby([col])['y'].

mean().

reset_index(), on=col, how='left') # plot fig, ax = plt.

subplots(figsize=(8, 4)) plt.

xticks(temp_df.

index, temp_df[col], rotation=0) ax2 = ax.

twinx() ax.

bar( temp_df.

index, temp_df[col + '_perc_cars'], color='lightgrey', label=col) ax2.

plot( temp_df.

index, temp_df["y"], color='green', ) ax.

set_ylabel('percentage of cars per category') ax2.

set_ylabel('Seconds') ax.

legend() plt.

show()Here we can see for example, that cars with the label f for variable X3, tend to spend less time in testing, and all the infrequent labels together tend to behave overall like the features c and a as well, in terms of time to pass the test.

In the ideal scenario, we would also like to have the standard deviation / interquantile range for the time to pass the test, to get an idea of how variable the time to pass is for each category.

Rare labels lead to uneven distribution of categories in train and test setsSimilarly to highly cardinal variables, rare or infrequent labels often land only on the training set, or only on the testing set.

If present only in the training set, they may lead to overfitting.

If present only on the testing set, the machine learning algorithm will not know how to handle them, as it has not seen it during training.

Let’s explore this further.

# let's separate into training and testing setX_train, X_test, y_train, y_test = train_test_split( data[cols_to_use], data.

y, test_size=0.

3, random_state=0)X_train.

shape, X_test.

shape# Let's find out labels present only in the training set# I will use X2 as exampleunique_to_train_set = [ x for x in X_train['X2'].

unique() if x not in X_test['X2'].

unique()]print(unique_to_train_set)There are 7 categories present in the train set and are not present in the test set.

# Let's find out labels present only in the test setunique_to_test_set = [ x for x in X_test['X2'].

unique() if x not in X_train['X2'].

unique()]print(unique_to_test_set)In this case, there are 2 rare values present in the test set only.

Effect of rare labels on machine learning algorithmsIn order to use these variables to build machine learning using sklearn, first, we need to replace the labels by numbers.

The correct way to do this, is to first separate into training and test sets.

And then create a replacing dictionary using the train set and replace the strings both in train and test using the dictionary created.

This will lead to the introduction of missing values/NaN in the test set, for those labels that are not present in the train set we saw this effect in the previous lecture in the section dedicated to rare values later in the course, I will show you how to avoid this problem now, in order to speed up the demonstration, I will replace the labels by strings in the entire dataset, and then divide into train and test.

But remember: THIS IS NOT GOOD PRACTICE!# original variablesfor col in cols_to_use: # create the dic and replace the strings in one line data.

loc[:, col] = data.

loc[:, col].

map( {k: i for i, k in enumerate(data[col].

unique(), 0)})# variables with grouped categoriesfor col in ['X1_grouped', 'X6_grouped', 'X3_grouped', 'X2_grouped']: # create the dic and replace the strings in one line data.

loc[:, col] = data.

loc[:, col].

map( {k: i for i, k in enumerate(data[col].

unique(), 0)})data.

head(10)Let’s remind ourselves the original columns:cols_to_use# let's add the grouped variables to a listcols_grouped = ['X1_grouped', 'X6_grouped', 'X3_grouped', 'X2_grouped']cols_grouped# let's combine the list of variablescols = cols_to_use+cols_grouped# let's separate into training and testing setX_train, X_test, y_train, y_test = train_test_split( data[cols], data.

y, test_size=0.

3, random_state=0)X_train.

shape, X_test.

shapeNext, I will build a series of machine learning algorithms, using comparatively the original categorical variables, and those where the infrequent labels were grouped, and then examine their performance.

Random Forests# model built on data with infrequent categories (original)# call the modelrf = RandomForestRegressor(n_estimators=300, max_depth=34, random_state=39)# train the modelrf.

fit(X_train[cols_to_use], y_train)# make and print predictions in train and test setsprint('Train set')pred = rf.

predict(X_train[cols_to_use])print('Random Forests mse: {}'.

format(mean_squared_error(y_train, pred)))print('Random Forests r2: {}'.

format(r2_score(y_train, pred)))print('Test set')pred = rf.

predict(X_test[cols_to_use])print('Random Forests mse: {}'.

format(mean_squared_error(y_test, pred)))print('Random Forests r2: {}'.

format(r2_score(y_test, pred)))We can see from the mean squared error and the r2 that the Random Forests are over-fitting to the train set.

The mse for the train set is less than half the value of the mse of the test set.

The r2 in the test set is significantly lower than the one in the train set.

# model built on data with rare values grouped into one category: rare# call the modelrf = RandomForestRegressor(n_estimators=300, max_depth=4, random_state=39)# train the modelrf.

fit(X_train[cols_grouped], y_train)# make and print preditionsprint('Train set')pred = rf.

predict(X_train[cols_grouped])print('Random Forests mse: {}'.

format(mean_squared_error(y_train, pred)))print('Random Forests r2: {}'.

format(r2_score(y_train, pred)))print('Test set')pred = rf.

predict(X_test[cols_grouped])print('Random Forests mse: {}'.

format(mean_squared_error(y_test, pred)))print('Random Forests r2: {}'.

format(r2_score(y_test, pred)))We can see an improvement in Random Forests: when we train the model using all the labels, the model has a strong over-fit to the training set.

However, when we train the model using fewer categories, the Random Forests over-fit much less (mse 73 vs mse 136; r2 0.

12 vs 0.

52).

In addition, the second model still keeps a similar generalisation power to unseen data.

Disclaimer: I am using a max_depth of 4, in order to make the forest overfit, so that I can show you the effect of rare labels on this algorithm.

We can indeed improve the fit and generalisation by building less and more shallow trees.

AdaBoost# model build on data with all the categories ada = AdaBoostRegressor(n_estimators=400, random_state=44) ada.

fit(X_train[cols_to_use], y_train) print(‘Train set’) pred = ada.

predict(X_train[cols_to_use]) print(‘AdaBoost mse: {}’.

format(mean_squared_error(y_train, pred))) print(‘AdaBoost r2: {}’.

format(r2_score(y_train, pred))) print(‘Test set’) pred = ada.

predict(X_test[cols_to_use]) print(‘AdaBoost mse: {}’.

format(mean_squared_error(y_test, pred))) print(‘AdaBoost r2: {}’.

format(r2_score(y_test, pred)))# model build on data with fewer categories in Cabin Variableada = AdaBoostRegressor(n_estimators=400, random_state=44)ada.

fit(X_train[cols_grouped], y_train)print('Train set')pred = ada.

format(mean_squared_error(y_train, pred)))print('AdaBoost r2: {}'.

format(r2_score(y_train, pred)))print('Test set')pred = ada.

format(mean_squared_error(y_test, pred)))print('AdaBoost r2: {}'.

format(r2_score(y_test, pred)))We see an improvement in Adaboost when trained using the variables with less categories.

The mse is smaller in the latter and the r2 is higher, both for the training and testing sets.

Linear Regression# model build on data with plenty of categorieslinreg = LinearRegression()linreg.

fit(X_train[cols_to_use], y_train)print('Train set')pred = linreg.

predict(X_train[cols_to_use])print('Linear Regression mse: {}'.

format(mean_squared_error(y_train, pred)))print('Linear Regression r2: {}'.

format(r2_score(y_train, pred)))print('Test set')pred = linreg.

predict(X_test[cols_to_use])print('Linear Regression mse: {}'.

format(mean_squared_error(y_test, pred)))print('Linear Regression r2e: {}'.

format(r2_score(y_test, pred)))# model build on data with infrequent categories grouped under one labellinreg = LinearRegression()linreg.

fit(X_train[cols_grouped], y_train)print('Train set')pred = linreg.

predict(X_train[cols_grouped])print('Linear Regression mse: {}'.

format(mean_squared_error(y_train, pred)))print('Linear Regression r2: {}'.

format(r2_score(y_train, pred)))print('Test set')pred = linreg.

predict(X_test[cols_grouped])print('Linear Regression mse: {}'.

format(mean_squared_error(y_test, pred)))print('Linear Regression r2e: {}'.

format(r2_score(y_test, pred)))Here again, the Linear Regression also benefited from removing rare labels: the mse has decreased in both train and test sets, and the r2 has increased.

So now you know how having less categories and grouping the infrequent ones into one single group can increase the performance of the machine learning models.

DiogoRibeiro7/Medium-BlogSome Jupyter Notebooks that were published in my Medium Blog – DiogoRibeiro7/Medium-Bloggithub.

com.