The following figure shows the number of features over the percentage of missing values and helps to define a threshold (=0.
2) for dropping the features.
A total of 73 features was affected.
Histogram: number of features / percentage of missing valuesStep 4: Drop rows with high percentage of unknown valuesThe same approach as in step 3 was applied to rows with a high percentage of missing values.
The following figure shows the number of rows over the percentage of missing values and helps again to define a threshold (=0.
3) for dropping the rows.
Histogram: number of rows / percentage of missing valuesAdditionally to step 3 the rows were not just dropped but split and saved in a separate dataset for later processing.
Step 5: Re-encode featuresThe remaining dataset contains four types of features: categorical, mixed, numeric and ordinal.
The majority of the features are numeric or ordinal and could be left without re-encoding.
An assessment of the 26 categorical features showed that four of them were dropped during earlier preprocessing steps already.
Three more (CAMEO_DEU_2015, LP_FAMILIE_FEIN, LP_STATUS_FEIN) could be dropped from the original dataset because of redundancy (fine and rough feature available).
Only one feature (OST_WEST_KZ) had to be one hot encoded.
All others had numeric values and weren’t changed for simplification reasons.
Furthermore the four mixed features were assessed.
Feature PRAEGENDE_JUGENDJAHRE was split into the new features DECADE and MOVEMENT, as well as feature CAMEO_INTL_2015 was split into WEALTH and LIFE_STAGE.
The information contained in the two remaining mixed features (LP_LEBENSPHASE_FEIN, LP_LEBENSPHASE_GROB) was redundant to other features and somehow not clearly structured, therefore it was decided to drop both features.
Step 6: Imputing and ScalingLast but not least missing values were imputed with the median of the corresponding feature.
Because of the majority of categorical and ordinal features it is superior to imputing mean values.
Afterwards the features were standardised.
ImplementationFor clustering a PCA is not necessarily a precondition, but it reduces noise and therefore clustering methods are better able to distinguish the different clusters (see reference ).
That’s the reason why the first step of the customer segmentation was a PCA with all available components.
Using a scree plot helped to identify to how many components the PCA could be reduced (around 150 features for an explained variance of 0.
The following figure shows the scree plot for a PCA with all components.
PCA: explained variance / number of featuresThe results of the PCA with a reduced number of components then were used as input for clustering with KMeans.
To find the right amount of clusters KMeans was applied in a loop with a number of clusters from 1 to 20, afterwards the elbow method was used for evaluation.
Plotting the proportions of the cluster counts for both groups — general population and customers — is a simple method for finding out which clusters are overrepresented/underrepresented.
Overrepresented customer clusters can be clearly identified as the target group.
In that specific case also the individuals with many missing values (that were separated during preprocessing) could be handled as an own cluster.
For the feature selection process an approach with a supervised learning model was chosen (see reference ).
The predictions of the KMeans clustering process were used to fit a LogisticRegression classifier, afterwards its coefficients were taken to find the most important features for the classification.
In a last step mean and median for each of the features were calculated per cluster of interest.
ComplicationsDuring implementation two main problems occurred.
Within data preprocessing other approaches for replacing missing values by NaNs existed before coming up with the final solution.
The first approach was iterating manually over each column and row: its drawbacks were source code complexity and bad performance.
The second approach was a one-liner, eliminating the source-code complexity but with even decreased performance.
Finally the mask function of pandas’ DataFrame solved both problems.
It had to be decided how to perform feature selection for the target clusters.
It was a lot of effort to research the web what methods could be applied.
The decision was taken in favour of a supervised learning technique because of the simplicity of the implementation.
RefinementA couple of things had to be refined during the implementation:The complication regarding replacing NaN values described in the section above had to be corrected by iterating over the source code.
Originally the individuals with a high percentage of missing values weren’t extracted, instead all values were imputed by the features median value.
In that specific case it was decided to handle these individuals as separate cluster at a later stage of implementation.
ResultsIterating over KMeans with a different number of clusters resulted in the possibility to use the elbow method.
A decision in favour of nine clusters was taken.
The following figure shows the average distance to centroids over the number of clusters.
Elbow method: average distance to centroids / number of clustersComparing the proportions of the cluster counts for general population with the customers gave a clear result.
The following figure shows these proportions.
Cluster counts proportions for general population and customersTwo customer clusters were clearly overrepresented (cluster #4: 29.
7% / cluster #0: 9.
3%) and therefore its individuals predestined as customer base respectively target group for the mail-order company.
The following table shows mean and median for the five most important features of Cluster #4.
Feature Selection: Cluster #4Interesting for our analysis are especially the differences for the features ALTERSKATEGORIE_GROB and W.
Our target group is in comparison to the general population a couple of years older.
The share of customers living in the former West Germany corresponds to the share of the current population (around 80%).
Because of the high value for the general population (0.
92) we could tend to say if an individual lives in former West Germany it is more unlikely that this individual becomes a customer.
Further investigations showed we have to be careful with that statement (refer to cluster #7).
The following table shows mean and median for the five most important features of Cluster #0.
Feature Selection: Cluster #0The difference for the re-encoded WEALTH feature is remarkable.
Our target group is significantly wealthier than the general population.
Another cluster is slightly overrepresented (cluster #5: 13.
The difference is too small to further investigate it.
All others (clusters #1/2/3/6/7/8) are underrepresented.
All these clusters have not been further analysed, because they are clearly out of scope regarding customer acquisition.
One exception was made to cluster #7, the cluster with the highest gap (1.
7%), to find out what characteristics mark individuals that are somehow the opposite of our target group.
The following table shows mean and median for the five most important features of Cluster #7.
Feature Selection: Cluster #7The differences for all features are worth mentioning.
Our “anti-customer” is more less dutiful/traditional minded (higher value for SEMIO_PFLICHT), grew up over a decade later (mean for feature DECADE: late 70s vs.
early 60s) and are more cautious regarding financial investments (higher value for FINANZ_ANLEGER) — what might be related to the younger age.
It seems that he/she lives more in East Germany, but as mentioned within the analysis of Cluster #4 we have to be careful with that statement.
In that case the numbers tell us the exact opposite of what we expected.
The last cluster (cluster #9: 26.
9%) was artificially added to the plot.
It contains the individuals with many missing values that were extracted during data preprocessing.
Therefore the cluster was not created during the clustering process.
A further analysis to find the reasons for the high percentage of missing values could be worth the effort but was not considered within the project.
Part II: Supervised Learning ModelAnalysisAs well as for Part I an analysis of the provided data was done first.
Two main datasets were provided by Arvato Financial Solutions as comma-separated values file:MAILOUT_TRAIN (May 2018): demographic data for individuals who were targets of a marketing campaign; 42 982 persons, 367 featuresMAILOUT_TEST (May 2018): demographic data for individuals who were targets of a marketing campaign; 42 833 persons, 366 featuresOriginally both datasets were coherent, but the data has been split into two approximately equal parts.
The MAILOUT_TRAIN partition, includes a column RESPONSE, that states whether or not a person became a customer of the company following the campaign.
The MAILOUT_TEST partition doesn’t contain that column.
All other features match the features of the AZDIAS dataset.
In around 99% of the cases the individuals didn’t respond to the mailout, i.
the training dataset is extremely imbalanced.
Data PreprocessingBecause of the feature compliance of the datasets MAILOUT_TRAIN respectively MAILOUT_TEST and AZDIAS the same data preprocessing steps can be applied as described in Part I.
Therefore the implemented cleaning function could be used (steps 1 to 5) as well as imputing/scaling (step 6).
Nevertheless the following deviations are worth mentioning:The RESPONSE column had to be extracted from MAILOUT_TRAIN before cleaning the data (used for model training in following steps).
The features dropped because of a high percentage of missing values (Step 3) are different to AZDIAS/CUSTOMER processing: even if the threshold (=0.
25) remained almost the same, only one feature (KK_KUNDENTYP) was additionally dropped.
Dropping the rows because of a high percentage of missing values (Step 4) was not applied: this was convenient to get an additional group for the customer segmentation, but would have had a negative impact on our supervised learning model because of the dataset imbalance (dropping rows with positive RESPONSE value).
ImplementationThe most important class for implementing the supervised learning model for predicting which individuals are most likely to convert into becoming customers for the company was sklearn’s GridSearchCV.
Besides feeding it with the classifier, it also takes the following important parameters:param_grid: parameter names for tuning the classifierscoring: evaluation methodcv: determines the cross-validation splitting strategyFirst step to find a model with an adequate classification performance was testing several classification algorithms in their basic form, i.
Therefore the algorithms were fit with MAILOUT_TRAIN and the extracted RESPONSE column.
The most promising classifier then was taken for subsequent use.
It was tuned via GridSearchCV’s param_grid parameter.
With the classifier it was possible to directly analyse the most important features (feature selection).
The resulting model was finally used to predict the probabilities for the MAILOUT_TEST dataset.
ComplicationsThe most time consuming complication was based on a misinterpretation within the data preprocessing.
Erroneously it was assumed that the same features that had a high percentage of missing values in the AZDIAS dataset had to be dropped for MAILOUT_TRAIN/MAILOUT_TEST.
This resulted in a very bad performance for all used classifiers (at least 20% worse).
The (wrong) reason for that was found immediately: the dataset imbalance.
Resampling and/or changing class weights as described in various blog posts (see references  and ) had to be the answer, but unfortunately weren’t.
All attempts with the library imblearn (stands for imbalanced learning), SMOTE (Synthetic Minority Over-sampling Technique) and class weights (can be set via parameter for various classifiers; class_weight: ‘balanced’) failed.
RefinementA couple of things had to be refined during the implementation:The complication described in the section above had to be corrected by handing over a list of other features to be dropped by the cleaning function.
Originally the cleaning function wasn’t designed to skip dropping the rows with a high percentage of missing values.
But this was necessary in Part II of the project, so a redesign had to be implemented.
Cross-validation and scoring were implemented manually at first.
After introducing GridSearchCV for model tuning both functions could be simply replaced by GridSearchCV’s parameters.
ResultsAccuracy does not seem to be an appropriate performance evaluation method for imbalanced datasets.
Instead ROC AUC was used to evaluate performance (see reference ).
Additionally cross-validation was used automatically by using GridSearchCV, this increases the robustness by not reducing the training data.
It gets even better: GridSearchCV uses StratifiedKFold for classifiers.
StratifiedKFold keeps the relation of the output column, another advantage when working with imbalanced data.
The following table shows the base performance of some of the used classifiers:Classifier base performanceThe GradientBoostingClassifier (as most promising one) then was tuned with different parameters.
A certain parameter set (loss: exponential / default: deviance, max_depth: 2 / default: 3, n_estimators: 80 / default 100) led to an improved performance of 0.
770805086873 for MAILOUT_TRAIN and achieved a final ROC AUC score of 0.
79627 for MAILOUT_TEST.
An explanation for the good performance of Gradient Boosting is a built-in approach that combats class imbalance: it constructs successive training sets based on incorrectly classified examples (see reference ).
The feature selection showed that D19_SOZIALES (transactional activity based on the product group) with 18.
5% is by far the most influential feature when fitting the classifier.
This can be seen as side note and has no consequences whatsoever.
ConclusionIt was necessary to apply the majority of CRISP-DM (CRoss-Industry Standard Process for Data Mining) to both parts of the project.
Business Understanding was slipped in as part of the problem description, but Data Understanding, Data Preparation, Modelling and Evaluation had to be developed from scratch.
The general rule that Data Preparation is the most time consuming part in the process could be verified once again.
Part I showed how unsupervised learning techniques — namely PCA and Clustering with KMeans — were applied to distinguish groups of individuals that best describe the core customer base of the mail-order company.
A supervised learning model, in form of LogisticRegression, then helped to identify the main characteristics of these individuals.
Part II showed the straightforward way of building a supervised learning model.
The base performance of various classifiers was determined.
With the help of GridSearchCV the most promising one — GradientBoostingClassifier — was fitted respectively tuned to the training dataset (considering stratified cross-validation) and its performance was evaluated via ROC AUC.
A short analysis of the most important features completed the model creation before using it for predicting on the testing dataset which individuals of a marketing campaign are most likely to convert into becoming customers.
The most difficult hence interesting aspect while working on the project were the implications of an imbalanced dataset.
It affects almost every step when building a supervised learning model: choosing a classifier that has strategies to handle these kind of data, fitting the classifier with stratified cross-validation which keeps the imbalance and finally evaluating the model with the right metric.
What’s next?Further improvements to the implementation could be made.
As stated in the results of Part I the separation of individuals with a high percentage of missing values needs to be analysed in a better way.
What features are missing?Are the missing features connected in some way ?Why are these features missing?Is this the best approach to handle the data?What are the alternatives (e.
imputing values)?Answering these questions might help finding a better performing solution.
Furthermore the extent of source code for data preprocessing (including imputation and scaling) and classification is manageable.
But using sklearn’s Pipeline would definitely improve the quality of the source code.
Do you want to have a look into the source code?.The project’s jupyter notebook and referenced files can be found in this GitHub repository.
References Handling imbalanced datasets in machine learningBaptiste Rocca, 28th January 2019Handling imbalanced datasets in machine learningWhat should and should not be done when facing an imbalanced classes problem?towardsdatascience.
com Practical tips for class imbalance in binary classificationZichen Wang, 10th August 2018Practical tips for class imbalance in binary classification0.
Introduction and motivationtowardsdatascience.
com Combining PCA and K-meansSandro Saitta, 26th March 2007Combining PCA and K-meansAlthough often used in practice, K-means has several drawbacks.
The number of clusters has to be defined in advance and…www.
com Estimating the most important features in a k-means cluster partitionGyan Veda, 15th September 2014Estimating the most important features in a k-means cluster partitionThanks for contributing an answer to Cross Validated!.Please be sure to answer the question.
Provide details and share…stats.
com.. More details