Unsupervised Learning Project: Creating Customer SegmentsLearn how to develop and end-to-end Clustering and Dimensionality Reduction Project!Victor RomanBlockedUnblockFollowFollowingApr 28IntroductionThroughout this project, we are going to analyze the spending behaviours of several customers in some product categories.

The main goals of the project are:Grouping customers in clusters of similar spending characteristics.

Describing the variations within the different clusters, in order to find the best delivery structure for each group.

To carry out this project we will use the dataset that can be found in the following UCI Machine Learning Repository.

You can find the complete project, documentation and dataset on my GitHub page:https://github.

com/rromanss23/Machine_Leaning_Engineer_Udacity_NanoDegree/tree/master/projects/customer_segmentsWe will focus our analysis on the six product categories recorded for the customers, excluding the ‘Channel’ and ‘Region’ fields.

# Import libraries necessary for this projectimport numpy as npimport pandas as pdfrom IPython.

display import display # Allows the use of display() for DataFrames# Import supplementary visualizations code visuals.

pyimport visuals as vs# Pretty display for notebooks%matplotlib inline# Load the wholesale customers datasettry: data = pd.

read_csv("customers.

csv") data.

drop(['Region', 'Channel'], axis = 1, inplace = True) print("Wholesale customers dataset has {} samples with {} features each.

".

format(*data.

shape))except: print("Dataset could not be loaded.

Is the dataset missing?")Data ExplorationNow we will explore the dataset through visualizations and code in order to understand the relationships between features.

In addition, we will calculate a statistical description of the dataset and consider the overall relevance of each feature.

# Display a description of the datasetdisplay(data.

describe())# Display the head of the datasetdata.

head()Selecting SamplesIn order to understand better our dataset and how the data will be transformed through the analysis, we will select a few sample points and explore them in detail.

# Select three indices to sample from the datasetindices = [85,181,338]# Create a DataFrame of the chosen samplessamples = pd.

DataFrame(data.

loc[indices], columns = data.

keys()).

reset_index(drop = True)print("Chosen samples of wholesale customers dataset:")display(samples)ConsiderationsLet us consider now the total purchase cost of each product category and the statistical description of the dataset above for our sample customers.

If we would have to predict what kind of establishment (customer) each of the three samples chosen represent:Considering the mean values:Fresh: 12000.

2977Milk: 5796.

2Grocery: 3071.

9Detergents_paper: 2881.

4Delicatessen: 1524.

8We can make the following predictions:1) Index 85: Retailer:- Largest spending on detergents and paper and groceries of the entire dataset, which usually are products for houses.

– Higher than average spending on milk.

– Lower than average spending on frozen products.

2) Index 181: Large market- High spending on almost every product category.

– Highest spending on fresh products of the entire dataset.

Likely to be a large market.

– Low spending on detergents.

3) Index 338: Restaurant- The amount of every product is significantly lower than the previous two customers considered.

– The spending on fresh products is the lowest of the entire dataset.

– The spending on milk and detergent and papers is in the bottom quartile.

It may be a small and cheap restaurant which needs groceries and frozen food to serve the meals.

Feature RelevanceWe will now analyze the relevance of the features for understanding the purchasing behaviours of the customers.

In other words, to determine if a customer that purchase some amount of one category of products will necessarily purchase some proportional amount of another category of products.

We will study this by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

# Display the head of the datasetdata.

head(1)# Make a copy of the DataFrame, using the 'drop' function to drop the given featurenew_data = data.

drop('Grocery', axis=1)# Split the data into training and testing sets(0.

25) using the given feature as the target# Set a random state.

from sklearn.

model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(new_data, data.

Grocery, test_size=0.

25, random_state=42)# Create a decision tree regressor and fit it to the training setfrom sklearn.

tree import DecisionTreeRegressorregressor = DecisionTreeRegressor()regressor = regressor.

fit(X_train, y_train)prediction = regressor.

predict(X_test)# Report the score of the prediction using the testing setfrom sklearn.

metrics import r2_scorescore = r2_score(y_test, prediction)print("Prediction score is: {}".

format(score))We tried to predict the Grocery feature.

The reported prediction score was 67.

25%.

As we obtained a high score, it as indicator of a very good fit.

So this feature is easy to predict considering the rest of spendign habits and, therefore, not very necessary for identifying customers’ spending habits.

Visualizing Feature DistributionsIn order to understand better our dataset, we will display a scatter matrix of every product feature.

The product features that show a correlation in the scatter matrix will be relevant to predict others.

# Produce a scatter matrix for each pair of features in the datapd.

scatter_matrix(data, alpha = 0.

3, figsize = (14,8), diagonal = 'kde');@# Display a correlation matriximport seaborn as snssns.

heatmap(data.

corr())Using the scatter matrix and the correlation matrix as a references, we can infer the following:Data is not normally distributed, it is positively skewed and it resemeble the log-normal distribution.

In most plots, most data points lie near the origin which shows little correlation between them.

From the scatter plots and the heatmap of correlation, we can see that there is a strong correlation between the ‘Grocery’ and ‘Detergent_paper’ features.

The features ‘Grocery’ and ‘Milk’ also show a good degree of correlation.

This correlation confirms my guess about the relevance of the ‘Grocery’ feature, which can be accurately predicted with the ‘Detergent_paper’ feature.

And, therefore, is not an absolutely necessary feature in the dataset.

Data PreprocessingThis step is crucial to make sure that the results obtained are significant, meaningful and they are optimized.

We will preprocess data by scaling it and detecting potential outliers.

Feature ScalingUsually, when data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data.

One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness.

A simpler approach which can work in most cases would be applying the natural logarithm.

# Scale the data using the natural logarithm log_data = np.

log(data) # Scale the sample data using the natural logarithm log_samples = np.

log(samples) # Produce a scatter matrix for each pair of newly-transformed features pd.

scatter_matrix(log_data, alpha = 0.

3, figsize = (14,8), diagonal = 'kde');ObservationAfter applying a natural logarithm scaling to the data, the distribution of each feature appear much more normal.

For any pairs of features we have identified earlier as being correlated, we observe here that correlation is still present (and whether it is now stronger or weaker than before).

Displaying the actual data:# Display the log-transformed sample datadisplay(log_samples)Outlier DetectionDetecting outliers in the data is extremely important in the data preprocessing step of any analysis.

The presence of outliers can often skew results which take into consideration these data points.

Here, we will use Tukey’s Method for identfying outliers: An outlier step is calculated as 1.

5 times the interquartile range (IQR).

A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

outliers = []# For each feature find the data points with extreme high or low valuesfor feature in log_data.

keys(): # Calculate Q1 (25th percentile of the data) for the given feature Q1 = np.

percentile(log_data[feature],25) # Calculate Q3 (75th percentile of the data) for the given feature Q3 = np.

percentile(log_data[feature],75) # Use the interquartile range to calculate an outlier step (1.

5 times the interquartile range) step = 1.

5 * (Q3-Q1) # Display the outliers print("Data points considered outliers for the feature '{}':".

format(feature)) display(log_data[~((log_data[feature] >= Q1 – step) & (log_data[feature] <= Q3 + step))]) lista = log_data[~((log_data[feature] >= Q1 – step) & (log_data[feature] <= Q3 + step))].

index.

tolist() outliers.

append(lista)outliers # Detecting outliers that appear in more than one productseen = {}dupes = []for lista in outliers: for index in lista: if index not in seen: seen[index] = 1 else: if seen[index] == 1: dupes.

append(index) seen[index] += 1dupes = sorted(dupes)dupes# Removing outliers good_data = log_data.

drop(dupes, axis=0).

reset_index(drop=True)ObservationsDatapoints considered outliers that are present in more than one feature are: 65, 66, 75, 128, 154.

K-Means is heavily influenced by the presence of outliers as they increase significantly the loss function that the algorithm tries to minimize.

This loss function is the squared sum of the distances of each datapoint to the centroid, so, if the outlier is far enough, the centroid will be incorrectly situated.

Because of this, the outliers shoul be removed.

Feature TransformationNow we will use Principal Component Analysis (PCA) to extract conclusions about the hidden structure of the dataset.

PCA is used to calculate those dimensions that maximize variance, so we will find the combination of features that describe best each customer.

Principal Component Analysis (PCA)Once the data has been scaled to a normal distribution and the necessary outliers have been removed, we can apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved.

In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone.

# Get the shape of the log_sampleslog_samples.

shape# Apply PCA by fitting the good data with the same number of dimensions as featuresfrom sklearn.

decomposition import PCApca = PCA(n_components=good_data.

shape[1])pca = pca.

fit(good_data)# Transform log_samples using the PCA fit abovepca_samples = pca.

transform(log_samples)# Generate PCA results plotpca_results = vs.

pca_results(good_data, pca)ObservationsThe variance explained by the first two Principal Components is the 70.

68% of the total.

The variance explained by the first three Principal Components is the 93.

11% of the total.

Dimensions discussionDimension 1: This dimension represents well, in terms of negative variance, the following features: Detergent_Paper, Milk and groceries.

Mostly utilities for everyday consuming.

Dimension 2: This dimension represents well, in terms of negative variance, the following features: Fresh, Frozen and Delicatessen.

Mostly food consuming.

Dimension 3: This dimension represents well, in terms of positive variance, the Delicatessen features, and in terms of negative variance de Fresh feature.

Food to be consumed on the day.

Dimension 4: This dimension represents well, in terms of positive variance, the Frozen feature, and in terms of negative variance, the Delicatessen Feature.

Food that can be storaged.

# Display sample log-data after having a PCA transformation applieddisplay(pd.

DataFrame(np.

round(pca_samples, 4), columns = pca_results.

index.

values))Dimensionality ReductionWhen using principal component analysis, one of the main goals is to reduce the dimensionality of the data.

Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained.

Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem.

Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.

# Apply PCA by fitting the good data with only two dimensionspca = PCA(n_components=2).

fit(good_data)# Transform the good data using the PCA fit abovereduced_data = pca.

transform(good_data)# Transform log_samples using the PCA fit abovepca_samples = pca.

transform(log_samples)# Create a DataFrame for the reduced datareduced_data = pd.

DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])The cell below show how the log-transformed sample data has changed after having a PCA transformation applied to it using only two dimensions.

Observe how the values for the first two dimensions remains unchanged when compared to a PCA transformation in six dimensions.

# Display sample log-data after applying PCA transformation in two dimensionsdisplay(pd.

DataFrame(np.

round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))Visualizing a BiplotA biplot is a scatterplot where each data point is represented by its scores along the principal components.

The axes are the principal components (in this case Dimension 1 and Dimension 2).

The biplot shows the projection of the original features along the components.

A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.

# Create a biplotvs.

biplot(good_data, reduced_data, pca)Once we have the original feature projections (in red), it is easier to interpret the relative position of each data point in the scatterplot.

For instance, a point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk', 'Grocery'and 'Detergents_Paper', but not so much on the other product categories.

ClusteringIn this section, we will choose to use either a K-Means clustering algorithm or a Gaussian Mixture Model (GMM) clustering algorithm to identify the various customer segments hidden in the data.

We will then recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.

K-Means vs GMM1) The main advantages of using K-Means as a cluster algorithm are:- It is easy to implement.

– With large number of variables, if (K is small), it may be computationally faster than hierarchichal clustering.

– Consistent and scale-invariant.

– It is guaranteed to converge.

2) The main advantages of using Gaussian Mixture Models as a cluster algorithm are:- It is much more flexible in terms of cluster covariance.

Which means that each cluster can have unconstrained covariance structure.

In other words, whereas K-means assumes that every cluster have spherical estructure, GMM allows eliptical.

– Points can belong to different clusters, with different level of memebership.

This level of membership is the probability of each point to belong to each cluster.

3) Chosen algorithm:The chosen algorithm is Gaussian Mixture Model.

Because data is not splitted in clear and different clusters, so we do not know how many clusters there are.

Creating ClustersWhen the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data.

However, we can quantify the “goodness” of a clustering by calculating each data point’s silhouette coefficient.

The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar).

Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.

# Import the necessary librariesfrom sklearn.

mixture import GaussianMixturefrom sklearn.

metrics import silhouette_scorescores = {}for i in range(2,7): print('Number of clusters: ' + str(i)) # Apply your clustering algorithm of choice to the reduced data clusterer = GaussianMixture(random_state=42, n_components=i) clusterer.

fit(reduced_data) # Predict the cluster for each data point preds = clusterer.

predict(reduced_data) # Find the cluster centers centers = clusterer.

means_ print('Cluster Center: ' + str(centers)) # Predict the cluster for each transformed sample data point sample_preds = clusterer.

predict(pca_samples) print('Sample predictions: ' + str(sample_preds)) # Calculate the mean silhouette coefficient for the number of clusters chosen score = silhouette_score(reduced_data, preds) scores[i] = score print('Silhouette score is: ' + str(score), '.') print('Scores: ' + str(scores))The number of cluster with the best Silhouette Score is 2, with a score of 0.

42.

Cluster VisualizationOnce we’ve chosen the optimal number of clusters for the clustering algorithm using the scoring metric above, we can now visualize the results in the code block below.

# Apply your clustering algorithm of choice to the reduced data clusterer = GaussianMixture(random_state=42, n_components=2)clusterer.

fit(reduced_data)# Predict the cluster for each data pointpreds = clusterer.

predict(reduced_data)# Find the cluster centerscenters = clusterer.

means_print('Cluster Center: ' + str(centers))# Predict the cluster for each transformed sample data pointsample_preds = clusterer.

predict(pca_samples)print('Sample predictions: ' + str(sample_preds))# Calculate the mean silhouette coefficient for the number of clusters chosenscore = silhouette_score(reduced_data, preds)scores[i] = scoreprint('Silhouette score is: ' + str(score), '.')# Display the results of the clustering from implementationvs.

cluster_results(reduced_data, preds, centers, pca_samples)Data RecoveryEach cluster present in the visualization above has a central point.

These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters.

For the problem of creating customer segments, a cluster’s center point corresponds to the average customer of that segment.

Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.

# Inverse transform the centerslog_centers = pca.

inverse_transform(centers)# Exponentiate the centerstrue_centers = np.

exp(log_centers)# Display the true centerssegments = ['Segment {}'.

format(i) for i in range(0,len(centers))]true_centers = pd.

DataFrame(np.

round(true_centers), columns = data.

keys())true_centers.

index = segmentsdisplay(true_centers)Segment 0 may represent a a fresh food market as every feature except Frozen and Fresh are below the median.

Segment 1 may represent a supermarket as every feature except fresh and frozen are above the median.

The code below shows to which ckustr each sample point is predicted ot belong.

# Display the predictionsfor i, pred in enumerate(sample_preds): print("Sample point", i, "predicted to be in Cluster", pred)ObservationsSample point 0 → Supermarket and the original guess was a retailer.

This difference may be explained because of the size of the cluster (which is pretty big)Sample point 1 → Supermarket and the originak guess was the same.

Sample point 2 → Fresh food market and the original guess was a restaurant which is reasonable considering the amount of the spending of the features.

ConclusionHow can the wholesale distributor label the new customers using only their estimated product spending and the customer segment data?A supervised learning algorithm could be used with the estimated product spending as attributes and the customer segment as the target variable, making it a classification problem (we would have 2 possible labels).

As there is not a clear mathematical relationship between the customer segment and the product spending KNN could be a good algorithm to work with.

Visualizing Underlying DistributionsAt the beginning of this project, it was discussed that the 'Channel' and 'Region' features would be excluded from the dataset so that the customer product categories were emphasized in the analysis.

By reintroducing the 'Channel' feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset.

The code block below shows how each data point is labeled either 'HoReCa' (Hotel/Restaurant/Cafe) or 'Retail' the reduced space.

# Display the clustering results based on 'Channel' data vs.

channel_results(reduced_data, preds, pca_samples)We can observe that the cluster algorithm does a pretty good job of clustering the data to the underlying distribution as the cluster 0 can be associated perfectly with a retailer and the cluster 1 to the Ho/Re/Ca.

.