Udacity : Data Scientice Nano Degree Capstone Project — Create Customer Segmentation Report for Arvato Financial ServicesKuanYao HuangBlockedUnblockFollowFollowingMay 17Photo by Christiann Koepke on UnsplashProject OverviewBy creating customer segmentation and comparing to general population, one can know which part of the general population are more likely to be customer and which part are not.
I analyzed demographics data for customers of a mail-order sales company in Germany, comparing it to demographics information for the general population.
Firstly, I used different approaches to pre-process the data, and then I used unsupervised learning techniques, PCA (Principle Componets Analysis) and k-NN (k-nearest neighbor algorithm), to perform customer segmentation and to identify the core customer traits of the company.
Secondly, with demographics information for targets of a marketing campaign for the company, I used different models to predict which individuals are most likely to convert into customers.
Finally, I also tested the model in competition through kaggle, where the competition is here, and all the contents discussed is based on the jupyter notebook uploaded to Github repository here.
Problem StatementThe goal includes four parts as follows:Data pre-processing: clean and re-encode data.
Segmentation: use unsupervised learning techniques to create clusterings of customer and general population, and then identify the difference.
Prediction: use the demographic features to predict whether or not a person became a customer after a mailout campaign.
Use the same algorithm to predict and submit to Kaggle competition to get evaluation.
MetricsIn the segmentation part, explained variance ratio is be used in the PCA process.
Explained variance accounts for the ability to describe the whole feature variance, the more the explained variance, the more import of the component.
In the supervised model prediction parts, precision is mainly used as main metric.
Analysis and Methodology— Data Exploration, Visualization and PreprocessingThere are four datasets, all of which have identical demographics features (only part of them are different)Udacity_AZDIAS_052018.
csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns)Udacity_CUSTOMERS_052018.
csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns)Udacity_MAILOUT_052018_TRAIN.
csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).
In addition to the above data, there are two additional meta-data:DIAS Information Levels — Attributes 2017.
xlsx: a top-level list of attributes and descriptions, organized by informational categoryDIAS Attributes — Values 2017.
xlsx: a detailed mapping of data values for each feature in alphabetical orderAlso, I created a file called DIAS_Attributes_Values_2017_details.
csv based on the DIAS Attributes — Values 2017.
xlsx file, so that I can map each attribute to its type or missing value coding.
Deal with Missing/Unknown ValueDuring the Data pre-processing process, I re-encode the missing or unknown value as Nan to make sure the value are all encoded consistently.
For example, the ‘-1’ value means unknown in AGER_TYP attribute, I substitute it with np.
Evaluate missing data in each attributeIn each attribute, the proportion of missing value is calculated.
Looking into the distribution of missing ratio, most attribute are below 20%.
Therefore, I deleted the outliers based on this criterion.
1 Distribution of missing ratio for each attributeEvaluate missing data in each rowMissing data in each row is also evaluted.
It can be seen that most rows have no missing value, so I will drop all the rows that have at least one missing value.
2) However, in order to make sure the data I drop have no big inference to the whole data, I will check the distribution for some features.
2 Distribution of missing ratio in each rowBy comparing the distribution of values for at least five attributes between the two subsets: the data that does not include deleted rows and the data that only include the rows deleted, there is no large difference between the two.
3) So I will preserve this criterion in dealing with the missing value in rows.
3 Comparison of distribution of two data set on specific attributes (see Github for all graphs)Re-Encode and pick up featuresThere are four types of data in the data: ordinal, numeric, categorical and mixed.
For ordinal variables, although the values might not be linearly related, I assume the ordinal variables as interval variables here.
The remaining categorical and mixed variables will be re-encoded and selected.
Regarding categorical variables, for instance, the OST_WEST_KZ feature includes values of ‘W’ and ‘O’, which is binary.
I re-encode it as 0 and 1 numerical binary variable.
On the other hand, the CAMEO_DEUG_2015 feature includes ‘X’ value that is not described in the details documents so I re-encode it as Nan.
For the remaining categorical variables, I plotted the correlation graph to understand whether the features that shared common characteristics.
4 Correlation heat-map between categorical featuresBased on the graph above, LP_FAMILIE_FEIN and LP_FAMILIE_GROB are highly correlated, and LP_STATUS_FEIN and LP_STATUS_GROB are also highly correlated.
Since LP_FAMILIE_FEIN and LP_STATUS_FEIN are too detailed, I dropped those two here.
In addition, although CAMEO_DEUG_2015and CAMEO_DEU_2015 are not plotted in the graph above, based on the description file both of them are very similar.
So I dropped CAMEO_DEU_2015 to simplify.
Finally, there is another feature EINGEFUEGT_AM that is not described in the document, I dropped this one as well.
After re-encoding and selecting the categorical variables, I used get_dummies method in Pandas to transform to dummy value.
In addition to the categorical variables, there are six mixed features as follows:CAMEO_INTL_2015LP_LEBENSPHASE_FEINLP_LEBENSPHASE_GROBPLZ8_BAUMAXPRAEGENDE_JUGENDJAHREWOHNLAGEBy looking into CAMEO_INTL_2015 attribute, it seemed the attribute can be separated into two interval variables: one is related to wealth and the other one is related to lifestage, so I used map method to re-encode that feature.
Regarding LP_LEBENSPHASE_FEIN and LP_LEBENSPHASE_GROB, since both of them describe the life stage of a person which shares similar traits to CAMEO_INTL_2015, I dropped the two attributes here.
On the other hand, the PLZ8_BAUMAX and WOHNLAGE attributes can be regarded as interval variables to some extent, I preserved them here.
Finally, looking in to PRAEGENDE_JUGENDJAHRE feature, it looks it can also be separated into two parts: decades and movement.
Therefore I created two functions to process this variable.
After all the pre-processing and re-encoding process, the Udacity_AZDIAS_052018.
csv is cleaned and transformed to pandas daframe object azdias_informative_dummy.
I also create a function clean_data that includes all the process mentioned above, so that for each demographic data, the cleaning process is assured and remain the same.
Finally, some features got Nans during the re-encoding process, so I used Imputer to subsitute with means.
Then I used StandardScaler to scale all the features.
To have a check for the final output, I have count the number of Nans (missing ratio) in each attribute.
The missing ratio in each attribute (parial)Implementation and RefinementCustomer Segmentation Report PartFrom almost 425 features, I conducted PCA to extract import features of the general population.
Knowing the main explained variance for each attribute, I decided to use first 6 components in that after 6th componets, the accumulated variance increases slowly.
6 and Fig.
6 (Top): Explained variance across different PCA components.
7 Accumulated explained variance for top 100 PCA componentsAfter selecting the main PCA componets, I used KMeans method to apply clustering.
In order to decide how many clusters I should use, I investigate the number of clusterings to the KMeans score.
It is showed that when the number of clusterings increases to about 30, the error (score) gets low enough and does not change rapidly.
Therefore, I chose 30 as the number of clusterings.
8 The change of score (error) to k (number of clusterings)All the process and methods used above are also applied to customer data.
Therefore, I can get the clusterings labels for both general population data and customer data.
9) By plotting the distribution of the clusterings, I got the results of difference across the two datasets.
9 The distribution of the clusterings across two populationsHere in the table where gp_portion means the proportion of of the clusterings in general population and cs_portion means the proportion of of the clustering in customer population showed that the difference between customers and general population is largest in clusters 21 and 28, and lowest in clusters 6 and 20.
10 The difference of clustering distribution in the two datasetsSupervised Learning Model PartIn the supervised learning model part, since the data shares the same demographic features from the general population, all the pre-processing is the same as in the customer segmentation parts.
However, in order to assure the cleaned columns are the same and all the rows are preserved, the clean_data function is revised to clean_data_preserve_row.
After the preprocessing process, I used the same algorithm in the segmentation report part to get the PCA values and clusterings.
The independent variables are the original cleaned features, the 6 PCA components and the clustering label.
The dependent variable is the RESPONSE column in the dataset.
In order to predict the the RESPONSE value, which is 0 or 1 binary value, I used three models: logistic regression, random forest and k-NN to predict.
GridSearch is also conducted to get the best parameters.
ResultsInterpretation, Model Evaluation and ValidationCustomer Segmentation Report PartFirstly, I will discuss the 6 PCA components, and then I will use the clustering results above to understand the difference between general population and the customers.
For each PCA components, I showed the top feature details in order to get know to the characteristics.
11 and Fig.
12) The first column shows the importance for each feature.
11 The details of PCA components 1 to 3Fig.
12 The details of PCA components 4 to 6Here is the summary based on the details of each component.
For convenience, I will name each component and explain in behind.
PCA component 1 : Rich People Index.
Due to low mobility (moving patterns), high share of 1–2 family houses in the PLZ8, high share of cars per household, lots of buildings in the microcell and high financial interest.
It might be some people who care about financial news or products, and living in a busy city area and tend to live alone or live with partner only.
PCA component 2: Young and Wandering Index.
The person’s birth is mainly in 90s with digital media, not a money saver, has high online affinity, not a financial investor, and not traditional minded.
PCA compoent 3: Car Brand Mania Index.
Due to high share of BMW & Mercedes Benz, upper class cars or sportscars, and high share of top German manufacturer (Mercedes, BMW).
PCA compoent 4: Online Shopper Index.
Due to high transaction activity, high transaction activity MAIL-ORDER, high density of inhabitants per square kilometer, and large number of 6–10 family houses.
PCA component 5: Car Life Index.
Due to high share of cars with an engine power between 61 and 120 KW, high share of cars with no preowner, large number of cars with 5 seats, high share of Asian Manufacturers, and high share of newbuilt carsPCA compoent 6: Anti-society Index.
Due to low dreamily affinity, low familiar minded, low cultural and social minded, and not religious.
In addition to the PCA components, I also compared the difference between the general population and customers.
As mentioned in the previous section, I got the results below from Fig.
9 and Fig.
General population that are more likely to be part of the mail-order company’s main customer: cluster 21 and 28General population that are less likely to be part of the mail-order company’s main customer: cluster 6 and 20Looking into each clusters, here is the characteristic for each (take the criterion of absolute value of 3 for convenience):cluster 21: PCA component 2 (Young and Wandering Index) is negatively related.
cluster 28: PCA component 4 (Online Shopper Index) is positively related.
cluster 6: PCA component 1 (Rich People Index) and 3 (Car Brand Mania Index) are negatively related while PCA component 5 (Car Life Index) is positively relatedcluster 20: PCA component 1 (Rich People Index) is negatively related while 6 (Anti-society Index) are positively relatedCombining all the information above, here is the summary.
The population who tend to be customer: The people who may be adults and usually shop online.
The population who might not be customer: The people who are not rich, nor focusing on the brand of cars.
However, they tend to use cars often.
Some of them might have some anti-society characteristics.
To sum up, the target of the campaign should focus on the users who are not young and shops online often.
On the other hand, for those who are very rich or interested to cars a lot or have anti-society traits, there might be no need to send the campaign to them.
Supervised Learning Model PartData was separated into training data and test data.
However, the results were not good.
For the three models (logistic regression, random forest and k-NN), none of them has good precision.
I will discuss it in the next justification part.
JustificationSupervised Learning Model PartBecause the origin data is unbalanced (most RESPONSE is 0, value 1 is below 1 %), I have tried the following parts to solve the issue:Use GridSearch method.
Set ‘class_weight’ parameter.
Set the stratify parameter in train_test_split process.
Try more than 6 PCA components to 20 PCA components.
However, none of them increased the precision.
I will leave this part in the improvement section in the end.
Also, I have upload the prediction from the test data to kaggle.
As mentioned, the performance of the model is not good, so the score is also bad.
13 The submission to kaggleConclusionReflectionIn the Segmentation part, I have performed data pre-processing and used PCA method combined with k-NN to get the clusterings of different population.
The difference was discussed and I have known that which population might be potential customers and which population might not.
With understanding the difference, a company can be much more focus on their target, and then increase conversion rate or lower down their marketing cost.
The impact is large.
In the supervised learning part, however, the performance of models were not good.
It might be resulted from the biased data distribution where most value of dependent variable are 0.
Further investigation is needed.
ImprovementIn order to increase the performance of the supervised learning model, the following parts might be conducted.
The methods that deal with unbalanced data: under-sampling or over- sampling.
Increase PCA components.
Use only the PCA components as independent variables.