Logistic Regression Model Tuning with scikit-learn — Part 1

Logistic Regression Model Tuning with scikit-learn — Part 1Comparison of metrics along the model tuning processFinn QiaoBlockedUnblockFollowFollowingJan 8Classifiers are a core component of machine learning models and can be applied widely across a variety of disciplines and problem statements.

With all the packages available out there, running a logistic regression in Python is as easy as running a few lines of code and getting the accuracy of predictions on a test set.

What are some ways to improve on such a base model and how do the results compare?For the purpose of showing some techniques, we will run some models with the Bank Marketing dataset from the UCI Machine Learning Repository.

This dataset represents the direct marketing campaigns of a Portuguese bank and whether the efforts led to a bank term deposit.

Base Logistic Regression ModelAfter importing the necessary packages for the basic EDA and using the missingno package, it seems that most data is present for this dataset.

To run a logistic regression on this data, we would have to convert all non-numeric features into numeric ones.

There are two popular ways to do this: label encoding and one hot encoding.

For label encoding, a different number is assigned to each unique value in the feature column.

A potential issue with this method would be the assumption that the label sizes represent ordinality (i.

e.

a label of 3 is greater than a label of 1).

For one hot encoding, a new feature column is created for each unique value in the feature column.

The value would be 1 if the value was present for that observation and 0 otherwise.

This method however could easily lead to an explosion in number of features and lead to the curse of dimensionality.

Below, we try to fit one model with only dummy variables and another with only label encoded variables.

Model accuracy is 0.

8963340616654528Model accuracy is 0.

9053168244719592While the resulting model accuracies are quite comparable, a quick look at AUC, another model metric indicates a drastic improvement when label encoding were used.

It seems that label encoding performs much better across the spectrum of different threshold values.

However, there are a few features in which the label ordering did not make sense.

For example, days of week:{'fri': 1, 'mon': 2, 'thu': 3, 'tue': 4, 'wed': 5}Furthermore, the ‘job’ feature in particular would be more explanatory if converted to dummy variables as one’s job would appear to be an important determinant of whether they open a term deposit and an ordinal scale wouldn’t quite make sense.

Below, custom orders were determined for education, month, and day of the week while dummy variables were created for jobs.

Model accuracy is 0.

9053168244719592The resulting model has a comparable accuracy to only label encoded variables while maintaining a similarly high AUC score of 0.

92.

More importantly, the new mix of labels and dummy variables can now be clearly explained and identified.

Normalization and ResamplingThe above base model was performed on the original data without any normalization.

Here, we adopt the MinMaxScaler and constrain the range of values to be between 0 and 1.

Model accuracy is 0.

906409322651129One might also be skeptical of the immediate AUC score of around 0.

9.

Upon examining the sample of the response variable, there appears to be a class imbalance problem where only around 10% of the customers subscribed to the term deposit.

There is debate around the severity of the class imbalance issue with a 10:90 split as there are many conversion experiments out there that could have up to a 1:99 split.

Nonetheless, we explore a resampling technique here using SMOTE.

In our particular scenario, we oversample the minority class by synthetically generating additional samples.

Model accuracy is 0.

8661082787084243While the resampled data slightly outperformed on AUC, the accuracy drops to 86.

6%.

This is in fact even lower than our base model.

Random Forest Regression ModelWhile we have been using the basic logistic regression model in the above test cases, another popular approach to classification is the random forest model.

Lets repeat the above two models with normalized data and resampled data with the random forest model.

Normalized Model accuracy is 0.

9059237679048313Resampled Model accuracy is 0.

9047098810390871Both have comparable accuracy scores but it is interesting to note how the model accuracy on the resampled data greatly improved with the random forest model as opposed to the base logistic regression model.

While both AUC scores were slightly lower than those of the logistic models, it seems that using a random forest model on resampled data performed better on aggregate across accuracy and AUC metrics.

Grid SearchIt is notable that the above models were run with the default parameters determined by the LogisticRegression and RandomForestClassifier modules.

Could we improve the model by tuning the hyperparameters of the model?To achieve this, we define a “grid” of parameters that we would want to test out in the model and select the best model using GridSearchCV.

With the above grid search, we utilize a parameter grid that consists of two dictionaries.

The first dictionary includes all variations of LogisticRegression I want to run in the model that includes variations with respect to type of regularization, size of penalty, and type of solver used.

The second dictionary includes all variations of RandomForestClassifier and includes different ranges for the number of estimators (trees) and the maximum number of features used in the model.

Fitting 5 folds for each of 100 candidates, totalling 500 fits[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.

[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 4.

7s[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 43.

1s[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 6.

4min[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed: 8.

8min finishedWith the defined parameter ranges, 100 potential models were evaluated.

As I had chosen a 5-fold cross validation, that resulted in 500 different models being fitted.

This took around 9 minutes.

This is what the “best” model looks like for parameters:RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features=6, max_leaf_nodes=None, min_impurity_decrease=0.

0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.

0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)The accuracy of 0.

908 and AUC score of 0.

94 are both the highest we’ve seen of each respective metric from all models so far.

Model accuracy is 0.

9083515416363195Repeating the same fit on resampled data yielded the same accuracy and classification report but took up to 23 minutes to run.

Comparison of Base Model to “Final” ModelHow have different classification metrics improved from our base model?Base model classification report: precision recall f1-score support 0 0.

97 0.

92 0.

95 7691 1 0.

38 0.

64 0.

47 547 micro avg 0.

91 0.

91 0.

91 8238 macro avg 0.

67 0.

78 0.

71 8238weighted avg 0.

93 0.

91 0.

92 8238“Final” model classification report: precision recall f1-score support 0 0.

97 0.

94 0.

95 7537 1 0.

48 0.

64 0.

55 701 micro avg 0.

91 0.

91 0.

91 8238 macro avg 0.

72 0.

79 0.

75 8238weighted avg 0.

92 0.

91 0.

92 8238It appears that all models performed very well for the majority class, with precision, recall metrics all above 0.

9.

The new improved model though performs much better for the minority class and arguably the “more important” classification of whether a customer was going to subscribe to the term deposit.

The AUC score was also improved to 0.

94 which suggests that the final model also performs better across different threshold values.

While we have managed to improve the base model, there are still many ways to tune the model including polynomial feature generation, sklearn feature selection, and tuning of more hyperparameters for grid search.

These will be the focus of Part 2!.In the meantime, thanks for reading and the code can be found here.

Feel free to connect on LinkedIn as well!.. More details