Hands-On with Unsupervised LearningA quick tutorial on k-means clustering and principal component analysis (PCA).
Marco PeixeiroBlockedUnblockFollowFollowingJan 30Photo by Ryoji Iwata on UnsplashIn a previous post, unsupervised learning was introduced as a set of statistical tools for scenarios in which there is a set of features, but no targets.
Therefore, this tutorial will be different than other ones, since we will not be able to make predictions.
Instead, we will work with k-means clustering to perform color quantization on an image.
Then, we will use PCA for dimensionality reduction and visualization of a dataset.
The full notebook is available here.
Spin up your Jupyter notebook, and let’s go!Unlike a chainsaw, you can use this tutorial unsupervisedSetupBefore starting on any implementation, we will import a few libraries that will become handy later on:Unlike previous tutorials, we will not import datasets.
Instead, we will use data provided by the scikit-learn library.
Color quantization — k-means clusteringQuickly, color quantization is technique to reduce the number of distinct colors used in an image.
This is especially useful to compress images while keeping the integrity of the image.
To get started, we import the following libraries:Notice that we import a sample dataset called load_sample_image.
This simply contains two images.
We will use one of them to perform color quantization.
So, let’s show the image we will use for this exercise:And you should see:Original imageNow, for color quantization, different steps must be followed.
First, we need to change the image into a 2D matrix for manipulation:Then, we train our model to aggregate colors in order to have 64 distinct colors in the image:Then, we build a helper function to help us reconstruct the image with the number of specified colors:Finally, we can now visualize how the image looks with only 64 colors, and how it compares to the original one:Original image with 96 615 colorsReconstructed image with 64 colorsOf course, we can see some differences, but overall, the integrity of the image is conserved!.Do explore different number of clusters!.For example, here is what you get if you specify 10 colors:Reconstructed image with 10 colorsDimensionality reduction — PCAFor this exercise, we will use PCA to reduce the dimensions of a dataset so we can easily visualize it.
Therefore, let’s import the iris dataset from scikit-learn:Now, we will compute the first two principal components and see what proportion of the variance can be explained by each:From the above code block, you should see that the first principal component contains 92% of the variance, while the second accounts for 5% of the variance.
Therefore, this means that only two features are sufficient to explain 97% of the variance in the dataset!Now, we can use this to easily plot the data in two dimensions:And you get:That’s it!.You now know how to implement k-means and PCA!.Again, keep in mind that unsupervised learning is hard, because there is no error metric to evaluate how well the algorithm performed.
Also, these techniques are usually used in exploratory data analysis prior to making supervised learning.
Leave me a comment to ask a question or to tell me how to improve!Keep working hard!These exercises were examples available on the scikit-learn website.
I simply tried to explain them with more details.