The answer: log scaling.

Log Transformation#Plotting a histogram – log scalesns.

set_style('whitegrid') #Picking a background colorfig, ax =plt.

subplots()books_file['ratings_count'].

hist(ax=ax, bins=10) #State how many bins you wantax.

set_yscale('log') #Recaling to log, since large numbers can mess up the weightings for some modelsax.

tick_params(labelsize=10)ax.

set_xlabel('Ratings Count')ax.

set_ylabel('Event') #How many times the specific numbers of ratings count happenedNow we’ll make rating predictions from normal values vs.

log transformed values of rating counts.

For the code below, we are adding +1, since the log of 0 is undefined, which would cause our computer to blow up (kidding, sort of).

books_file['log_ratings_count'] = np.

log10(books_file['ratings_count']+1)#Using ratings count to predict average rating.

Cross Validation (CV) is normally 5 or 10.

base_model = linear_model.

LinearRegression()base_scores = cross_val_score(base_model, books_file[['ratings_count']], books_file['average_rating'], cv=10)log_model = linear_model.

LinearRegression()log_scores = cross_val_score(log_model, books_file[['log_ratings_count']], books_file['average_rating'], cv=10)#Display the R^2 values.

STD*2 for 95% confidence levelprint("R^2 of base data: %0.

4f (+/- %0.

4f)" % (base_scores.

mean(), base_scores.

std()*2))print("R^2 of log data: %0.

4f (+/- %0.

4f)" % (log_scores.

mean(), log_scores.

std()*2))R² of base data: -0.

0024 (+/- 0.

0125)R² of log data: 0.

0107 (+/- 0.

0365)Both of the models are quite terrible, which is not too surprising from only using one feature.

I find it kind of funny seeing a negative R squared from a standard statistics 101 perspective.

A negative R squared in our case means a straight line given that one feature actually predicts worse (aka does not follow the trend of the straight line/linear regression).

Feature ScalingWe’ll go over three ways to feature scale: min max scaling, standardization, and L2 normalization.

#Min-max scalingbooks_file['minmax'] = pp.

minmax_scale(books_file[['ratings_count']])#Standarizationbooks_file['standardized'] = pp.

StandardScaler().

fit_transform(books_file[['ratings_count']])#L2 Normalizationbooks_file['l2_normalization'] = pp.

normalize(books_file[['ratings_count']], axis=0) #Needs axis=0 for graphing#Plotting histograms of scaled featuresfig, (ax1, ax2, ax3, ax4) = plt.

subplots(4,1, figsize=(8, 7))fig.

tight_layout(h_pad=2.

0)#Normal rating countsbooks_file['ratings_count'].

hist(ax=ax1, bins=100)ax1.

tick_params(labelsize=10)ax1.

set_xlabel('Review ratings count', fontsize=10)#Min max scalingbooks_file['minmax'].

hist(ax=ax2, bins=100)ax2.

tick_params(labelsize=10)ax2.

set_xlabel('Min max scaled ratings count', fontsize=10)#Standardizationbooks_file['standardized'].

hist(ax=ax3, bins=100)ax3.

tick_params(labelsize=10)ax3.

set_xlabel('Standardized ratings_count count', fontsize=10)#L2 Normalizationbooks_file['l2_normalization'].

hist(ax=ax4, bins=100)ax4.

tick_params(labelsize=10)ax4.

set_xlabel('Normalized ratings count count', fontsize=10)The graph above shows a histogram of the normal data, min max scaled transformation, standardized transformation, and normalization transformation.

Overall, the transformations are pretty similar, but you would want to pick one over the other dependent upon other features in your dataset.

Now, we’ll make predictions from our 3 feature scaled data.

#Using ratings count to predict average rating.

Cross Validation (CV) is normally 5 or 10.

base_model = linear_model.

LinearRegression()base_scores = cross_val_score(base_model, books_file[['ratings_count']], books_file['average_rating'], cv=10)minmax_model = linear_model.

LinearRegression()minmax_scores = cross_val_score(log_model, books_file[['minmax']], books_file['average_rating'], cv=10)standardized_model = linear_model.

LinearRegression()standardized_scores = cross_val_score(base_model, books_file[['standardized']], books_file['average_rating'], cv=10)l2_normalization_model = linear_model.

LinearRegression()l2_normalization_scores = cross_val_score(log_model, books_file[['l2_normalization']], books_file['average_rating'], cv=10)#Display R^2 values.

STD*2 for 95% confidence levelprint("R^2 of base data: %0.

4f (+/- %0.

4f)" % (base_scores.

mean(), base_scores.

std()*2)) print("R^2 of minmax scaled data: %0.

4f (+/- %0.

4f)" % (minmax_scores.

mean(), minmax_scores.

std()*2))print("R^2 of standardized data: %0.

4f (+/- %0.

4f)" % (standardized_scores.

mean(), standardized_scores.

std()*2)) print("R^2 of L2 normalized data: %0.

4f (+/- %0.

4f)" % (l2_normalization_scores.

mean(), l2_normalization_scores.

std()*2))R² of base data: -0.

0024 (+/- 0.

0125)R² of minmax scaled data: 0.

0244 (+/- 0.

0298)R² of standardized data: 0.

0244 (+/- 0.

0298)R² of L2 normalized data: 0.

0244 (+/- 0.

0298)Slight improvement over the log transformations.

Since most of the scaling types produced the same shape graphically, no surprise they gave the same values for R squared.

Few things to note for each scaling method.

Min max is good to get all feature values to be between 0 to 1.

Standardization is good to scale the variance of features, so making the mean = 0 and variance = 1 (aka normal distribution style).

L2 normalization works by making the features scale into a Euclidean or XY plane norm.

Important note, feature scaling does not change the shape of your feature, since under the hood it divides by a constant.

ConclusionAwesome, we covered a brief overview of the machine learning pipeline and where feature engineering fits in.

Then we went over binning, log transformation, and various forms of feature scaling.

Along the way, we also viewed how feature engineering affects linear regression model predictions with book reviews.

Personally, I find min max to work well when dealing with probability distributions with most of the other features in other datasets.

On a similar note, standardization and L2 normalization works for scaling numbers down from really big numbers to similar features relative to a chosen dataset analyzed.

Disclaimer: All things stated in this article are of my own opinion and not of any employer.

[1] Kaggle, Goodreads-books (2019), https://www.

kaggle.

com/jealousleopard/goodreadsbooks[2] Microsoft, What are machine learning pipelines?.(2019), https://docs.

microsoft.

com/en-us/azure/machine-learning/service/concept-ml-pipelines.