multi_level = [] for column in df_copy.columns: if feat_info.loc[column].type == 'categorical' and len(df_copy[column].unique()) > 2: multi_level.append(column) for col in multi_level: df_copy.drop(col, axis=1, inplace=True) df_copy['decade'] = df_copy['PRAEGENDE_JUGENDJAHRE'].apply(create_interval_decade) df_copy['movement'] = df_copy['PRAEGENDE_JUGENDJAHRE'].apply(create_binary_movement) df_copy.drop('PRAEGENDE_JUGENDJAHRE', axis=1, inplace=True) df_copy['wealth'] = df_copy['CAMEO_INTL_2015'].apply(wealth) df_copy['life_stage'] = df_copy['CAMEO_INTL_2015'].apply(life_stage) df_copy.drop('CAMEO_INTL_2015', axis=1, inplace=True) df_copy = pd.get_dummies(data=df_copy, columns=['OST_WEST_KZ']) mixed = ['LP_LEBENSPHASE_FEIN','LP_LEBENSPHASE_GROB','WOHNLAGE','PLZ8_BAUMAX'] for c in mixed: df_copy.drop(c, axis=1, inplace=True) # Return the cleaned dataframe. return df_copyI made a cleaning data function that I could apply on both the general population demographics data and on the customer demographics data.Principal Component AnalysisPCA is one of the most used unsupervised machine learning tools. Principal components are linear combinations of the original features in a data set that aim to retain the most information in the original data. Principal Component Analysis is a common method for extracting new “latent features” from our data set, based on existing features. Think of a principal component in the same way that you think about a latent feature.When we have data sets with hundreds or thousands of features, we must reduce the number of dimensions in order to effectively build a model. There are two approaches to do this:a) Feature Selection: Feature Selection involves finding a subset of the original features of your data that you determine are most relevant and useful.b) Feature Extraction: Feature Extraction involves extracting, or constructing, new features called latent features.Feature TransformationI performed feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features.# Fill the Nan values with the mode of that respective column.for col in missing_data_rows_low.columns: missing_data_rows_low[col] = missing_data_rows_low[col].fillna(missing_data_rows_low[col].mode()[0])# Apply feature scaling to the general population demographics data.normalizer = StandardScaler()missing_data_rows_low[missing_data_rows_low.columns] = normalizer.fit_transform(missing_data_rows_low[missing_data_rows_low.columns])missing_data_rows_low.head()Dimensionality Reduction# Apply PCA to the data.pca = PCA()missing_data_rows_low_pca = pca.fit_transform(missing_data_rows_low)# Investigate the variance accounted for by each principal component.def scree_plot(pca): ''' Creates a scree plot associated with the principal components INPUT: pca – the result of instantian of PCA in scikit learn OUTPUT: None ''' num_components = len(pca.explained_variance_ratio_) ind = np.arange(num_components) vals = pca.explained_variance_ratio_ plt.figure(figsize=(10, 6)) ax = plt.subplot(111) cumvals = np.cumsum(vals) ax.bar(ind, vals) ax.plot(ind, cumvals) ax.xaxis.set_tick_params(width=0) ax.yaxis.set_tick_params(width=2, length=12) ax.set_xlabel("Principal Component") ax.set_ylabel("Variance Explained (%)") plt.title('Explained Variance Per Principal Component') scree_plot(pca)# Re-apply PCA to the data while selecting for number of components to retain.pca = PCA(n_components=41)missing_data_rows_low_pca = pca.fit_transform(missing_data_rows_low)Based on the PCA plot, the variance explained becomes extremely low after 41 components and it does not change afterwards..So I did the PCA again with 41 components.Each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components)..The further a weight is from zero, the more the principal component is in the direction of the corresponding feature..If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other..To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.Apply Clustering to the General Populationdef get_kmeans_score(data, center): ''' returns the kmeans score regarding SSE for points to centers INPUT: data – the dataset you want to fit kmeans to center – the number of centers you want (the k value) OUTPUT: score – the SSE score for the kmeans model fit to the data ''' #instantiate kmeans kmeans = KMeans(n_clusters=center) # Then fit the model to your data using the fit method model = kmeans.fit(data) # Obtain a score related to the model fit score = np.abs(model.score(data)) return score# Over a number of different cluster counts…# run k-means clustering on the data and…# compute the average within-cluster distances.scores = []centers = list(range(1,30,3))for center in centers: scores.append(get_kmeans_score(missing_data_rows_low_pca, center))# Investigate the change in within-cluster distance across number of clusters.# HINT: Use matplotlib's plot function to visualize this relationship.plt.plot(centers, scores, linestyle='–', marker='o', color='b');plt.xlabel('K');plt.ylabel('SSE');plt.title('SSE vs. K')# Re-fit the k-means model with the selected number of clusters and obtain# cluster predictions for the general population demographics data.# Re-fit the k-means model with the selected number of clusters and obtain# cluster predictions for the general population demographics data.kmeans = KMeans(n_clusters=22)model_general = kmeans.fit(missing_data_rows_low_pca)predict_general = model_general.predict(missing_data_rows_low_pca)Based on the plot, we can see that 22 seems to be a sufficient number of clusters..Afterwards the rate of change is SSE is extremely low.Compare Customer Data to Demographics DataAfter clustering the population demographics data, we apply the same data cleaning steps and clustering to the customers demographic data..The purpose is to see which is the strong customer base for the company.If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster’s centroid) then that suggests the people in that cluster to be a target audience for the company..On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.Analyze a cluster where the customer data is over represented# What kinds of people are part of a cluster that is overrepresented in the# customer data compared to the general population?over = normalizer.inverse_transform(pca.inverse_transform(customers_clean_pca[np.where(predict_customers==11)])).round()df_over = pd.DataFrame(data = over, columns = customers_clean.columns)df_over.head(10)This segment is composed of individuals in the age between 46 and 60+ that are not financial minimalists..I.e..they are probably people in retirement or close to it, that are eager to consume goods and services.Analyze a cluster where the customer data is under represented# What kinds of people are part of a cluster that is underrepresented in the# customer data compared to the general population?under = normalizer.inverse_transform(pca.inverse_transform(customers_clean_pca[np.where(predict_customers==16)])).round()df_under = pd.DataFrame(data=under, columns=customers_clean.columns)df_under.head(10)This segment is composed of people in the younger age group (up to 45 years old), with a larger proportion of unemployed people.Project ConclusionRecall in the data processing step that we determined the group of rows with high number of missing data will be considered as an additional cluster in the end..The cluster 22 in the above group is this last added cluster..we can see that there is quite a difference between population percentages and customer percentages in most clusters.. More details