Customer Segmentation Report for Arvato Financial Solutions

Customer Segmentation Report for Arvato Financial SolutionsElena IvanovaBlockedUnblockFollowFollowingDec 3Capstone Project for Udacity Data Scientist NanoDegreeIntroductionIn this project supervised and unsupervised learning techniques are used to analyze demographics data of customers of a mail-order sales company in Germany against demographics information for the general population..The goal of this project is to characterize customers segment of population, and to build a model that will be able to predict customers for Arvato Financial Solutions.The data for this project is provided by Udacity partners at Bertelsmann Arvato Analytics, and represents a real-life data science task..It includes general population dataset, customer segment data set, dataset of mailout campaign with response and test dataset that needs to make predictions.Problem StatementThere are four main parts of the project:Data PreprocessingIn this part we need to preprocess data for further analysis.Missing values by columns and rows will be analysed, data will be divided by types followed by subsequent transformations.2..It has 891211 persons (rows) and 366 features (columns).Descriptive statistics for the first few attributes of AZDIAS data setCUSTOMERS: Demographics data for customers of a mail-order company..It has 191652 rows and 369 features.Descriptive statistics for the first few attributes of CUSTOMERS data setMAILOUT_TRAIN: Demographics data for individuals who were targets of a marketing campaign; 42982 persons and 367 features including response of people.MAILOUT_TEST: Demographics data for individuals who were targets of a marketing campaign; 42833 persons and 366 features.Unfortunately, there are a lot of missing values in these datasets and not all of the listed features have explanation in a given Excel spreadsheets.Analysis of columns and rows with missing valuesFirst, I created python dictionary of missing values codes where the key in a “key”: value pair is attribute and value is a list of missing codes parsed from DIAS Attributs — Values 2017.xlsx.Interestingly, there are only 275 out of 366 items in the dictionary, meaning that there are a lot of features that are not listed in the given attribute description file as well as some of the attributes missing values are simply not entered and listed in dataset as numpy not a number (np.nan).First three items from missing keys dictionaryNext, values that correspond to missing value codes of AZDIAS dataset were converted to np.nan values and the final number of missing values were analyzed for each attribute.The analysis demonstrates that most of the columns have less than 30% of missing data while there are 41 attributes with more than 30% of missing data (see the distribution of missing values in these columns the below)..These 41 attributes were dropped from analysis.Attributes with more than 30% of missing values dropped from analysisAdditionally, there are other columns that were dropped based on the following reasons:column with unique values, LNRcategorical columns with more than 10 categories to avoid many additional attributes after one-hot encoding (CAMEO_INTL_2015 is exclusion)columns with information repetition from another feature (e.g. fein vs grob)some attributes, for which description was not given and it is hard to predict meaning and type of column (categorical vs ordinal vs mixed).The assement of missing values by rows demonstrates that the maximum number of missing data in each row is 233 attributes out of 303 attributes left after dropping columns..The distribution of amount of missing data in each row demonstrates that most of the rows have less than 25 missing attributes..So, the data was devided into two subsets: azdias with <=25 missing attributes (737235 rows) and azdias with > 25 missing attributes (153986 rows)..Comparison of distribution of values for 6 randomly choosen columns demonstrates that there is similar distribution in two data sets (see bar plot for 6 different attributes with few nan vs many nan datasets below).Comparison of distribution of values between dataset with few missing values (blue) vs dataset with many missing values (orange)Assigning types of attributesPart of the manually created table with attributes types and actions.In order to proceed to data engineering and transforming step, all attributes should be assigned to the following types: categorical, numerical, ordinal, binary, mixed..The advantage of standardization procedure that it does not bound values to a specific range and it is much less affected by outliers.Data transformation pipelineDistribution of skewed data for ANZ_HAUSHALTE_AKTIV attribute (skew=8.3)Fisrt, I identified skewed numerical continuous attributes using pandas skew method with skew threshold 1.0..In this transformation no redundant dummy columns were dropped to keep hidden information about missing values encoded by zeros..It prevents data leakage that could potentially result in overfitting algorithm, improves code organization and reduces risk of confusing columns in training and testing sets.Overall, after extract transform load (ETL) pipeline step, AZDIAS data was transformed to 737235 rows x 410 columns and CUSTOMERS data was transformed to 134245 rows x 410 columns data set.Customer SegmentationPCAPrincipal component analysis (PCA) on the data was applied for dimensionality reduction..So, 16 clusters were selected as ideal number for k-means clustering.Unsupervised machine learning pipelineUnsupervised learning pipeline was created consisting of the following steps: data transformation, PCA and KMeans (see below).cluster_pipeline = Pipeline([ ('transform', ct),  ('pca', PCA(n_components=175)), ('kmeans', KMeans(n_clusters=16) ) ]) = cluster_pipeline.predict(azdias_cleaned) customers_predictions = cluster_pipeline.predict(customers_cleaned)Fit and predict methods were applied to AZDIAS data and predict method was applied to CUSTOMERS data.Comparison of CUSTOMERS data to AZDIAS dataThe results of clustering general population and AZDIAS data were compared to each other using proportion of people in each group.Proportion of people for the general population and the customer dataDifference in proportion between customers and general audience: positive is overrepresented and negative is underrepresentedThe comparison of proportions of people and difference of proportion between general and customer audience in each cluster (customers_ratio — general_ratio) demonstrates that there are clusters with overrepresentation as well as underrepresentation of customers.. More details

Leave a Reply