This gives us a basis for dropping all the rows with more than 70% of NaN values, thereby obtaining a data frame prepared for more accurate findings.The set of bar plots below gives a comparison between the feature distribution of four randomly selected columns from the population datasets (on the left) and feature distribution on the same columns in the customers' dataset (on the right).Distribution of features in the population dataset (left chart), against the customers' dataset (right chart) for the first two random columnsDistribution of features in the population dataset (left chart), against the customers’ dataset (right chart) for the last two random columnsInspect and re-encode required featuresUpon analysis of our feature types, we notice a total of 28 features (7 of which are mixed types, and 21 categorical) requires some engineering..Binary Categorical features are left unchanged, while multi-level categorical ones are one-hot-Encoded to their binary counterparts..To enhance computational efficiency, and avoid too many features over the number of samples ratio, ‘CAMEO_DEUG’ feature with up to 44 categories was dropped.Out of the mixed type features, ‘ PRAEGENDE_JUGENDJAHRE’ (described as the dominating movement in the person’s youth) is engineered into two important features — ‘decades’ and ‘movement’ (mainstream vs. avantgarde)..‘ CAMEO_INTL_2015’ (international typology) as well is engineered into — ‘wealth’ and ‘life stage’ features.As for ‘WOHNLAGE’ (residential area), the two flags (7 and 8) are replaced with NaNs, which will later be imputed with the features mean given that we don’t know whether they describe a locality of high quality or not.Three other mixed type features (‘LP_FAMILIE_GROB’,’LP_LEBENSPHASE_GROB’ and ‘LP_STATUS_GROB’) are dropped, while their corresponding detailed versions are retained.‘LNR’ is a unique identifier without null values but contains no insights apart from being an identifier for each sample thus, it should be dropped.Out of the remaining numerical features, NZ_HH_TITEL, ANZ_TITEL, and ANZ_KINDER are dropped because they are biased towards many zero values (33486, 35674 and 33821 zero entries respectively), rendering them useless..Though numeric, ‘AKT_DAT_KL’, ‘PLZ8_BAUMAX’ and ‘ARBEIT’ are one-hot-encoded, given that their numeric values represent defined categories.The following table summarizes all our engineered features with their transformed status, and justifications.Table of selected features to process and justificationsFeature TransformationBefore dimensionality reduction, we impute and normalize the data, so that the principal component vectors are not influenced by differences in feature scales..To meet this objective, Sklearn’s Imputer (replacing null entries with feature’s mean) and its standard scaler library are used.Engineered Dataframe before normalizationEngineered Dataframe after NormalizationFeature ReductionAfter data engineering, we notice a large increase (from 79 to 132) in the number of features..Considering there is quite a large number of samples, we still need to perform some feature reduction not just to enhance computational efficiency, but also to explain feature importance.Principal component analysisWe use Sklearn’s PCA for feature reduction and the scree plot to depict the best number of features representing the entire dataset.PCA Scree plotFrom the scree plot, we can notice the first 260 components (out of 430) provides more than 90% of the information from the whole dataset..The remaining features provide quite a little information, which can otherwise be explained by the first 260..The table below depicts the first 10 features (out of 260) with the most information:10 most explained features by PCAIt is worth noting that from the table, tangible assets such as cars, vans, trailers, and motorcycles owned by individuals, happens to be critical factors we should consider when predicting the class of a sample.Summary of the Data engineering flow processUnsupervised learning modelClusteringKMeans clustering is used as the predictive model for the unlabeled dataset..To decide the appropriate number of clusters for this dataset, the elbow method is adopted using Sklearn’s MiniBatchKMeans to improve computational efficiency..A number of cluster ranges are chosen from 5 to 30, depending on the point where the mean distance between clusters starts decreasing by very minimal amounts.Resulting Curve from Elbow methodThe point most similar to an elbow from the graph is at 15, which qualify 15 as the desired number of clusters for accurate segmentation.Comparing Customer data to demographics dataAfter applying the data cleaning, imputation, and scaling transformations fitted on the demographic dataset to the Customers’ dataset, we predict the cluster to which each customer belongs, by transforming it with the fitted clusters obtained from the demographic dataset..This process results in a set of demographic clusters overrepresented by customers and another set of clusters underrepresented by customers.. More details