# Mixture modelling from scratch, in R

The full code is however provided in the Appendix.

Comparing to the Iris species labelling, we get an accuracy of 88.

7% (no error for setosa, 3 observations misclassified for versicolor and 14 for virginica — note that results can slightly change depending on the random initiation).

Using the kmeans() built-in R function yields a similar result but faster.

K-means clustering of the Iris dataset | The different iterations to convergence, with radius of the cluster spherical decision boundary plotted.

Gaussian Mixture ModelThe mixture of Gaussians (Gaussian Mixture Model or GMM) is the most widely used mixture model.

GMM can be described as a soft version of K-means with Gaussian density.

Below is a R code for the multidimensional case meaning that we go full tensorial, which is the best way to learn all the intricacies of GMM via the multivariate normal distribution and eigendecomposition of the covariance matrix.

Here we go:# Uses EM algorithm with multivariate normal# distribution to estimate cluster probabilitymvnorm.

cov.

inv <- function(Sigma) { # Eigendecomposition of covariance matrix E <- eigen(Sigma) Lambda.

inv <- diag(E\$values^-1) # diagonal matrix Q <- E\$vectors return(Q %*% Lambda.

inv %*% t(Q))}#multivariate Gaussian pdfmvn.

pdf.

i <- function(xi, mu, Sigma) 1/sqrt( (2*pi)^length(xi) * det(Sigma) ) * exp(-(1/2) * t(xi – mu) %*% mvnorm.

cov.

inv(Sigma) %*% (xi – mu) )mvn.

pdf <- function(X, mu, Sigma) apply(X, 1, function(xi) mvn.

pdf.

i(as.

numeric(xi), mu, Sigma))gmm.

fromscratch <- function(X, k){ p <- ncol(X) # number of parameters n <- nrow(X) # number of observations Delta <- 1; iter <- 0; itermax <- 30 while(Delta > 1e-4 && iter <= itermax){ # initiation if(iter == 0){ km.

init <- km.

fromscratch(X, k) mu <- km.

init\$centroid; mu_mem <- mu w <- sapply(1:k, function(i) length(which(km.

init\$cluster == i))) w <- w/sum(w) cov <- array(dim = c(p, p, k)) for(i in 1:p) for(j in 1:p) for(c in 1:k) cov[i, j, c] <- 1/n * sum((X[km.

init\$cluster == c, i] – mu[c, i]) * (X[km.

init\$cluster == c, j] – mu[c, j])) } # E-step mvn.

c <- sapply(1:k, function(c) mvn.

pdf(X, mu[c,], cov[,, c])) r_ic <- t(w*t(mvn.

c)) / rowSums(t(w*t(mvn.

c))) # M-step n_c <- colSums(r_ic) w <- n_c/sum(n_c) mu <- t(sapply(1:k, function(c) 1/n_c[c] * colSums(r_ic[, c] * X))) for(i in 1:p) for(j in 1:p) for(c in 1:k) cov[i, j, c] <- 1/n_c[c] * sum(r_ic[, c] * (X[, i] – mu[c, i]) * r_ic[, c] * (X[, j] – mu[c, j])) Delta <- sum((mu – mu_mem)^2) iter <- iter + 1; mu_mem <- mu } return(list(softcluster = r_ic, cluster = apply(r_ic, 1, which.

max)))}# run GMMgmm <- gmm.

fromscratch(X, 3)pairs(X, lower.

panel = NULL, col = gmm\$cluster)table(y, gmm\$cluster)Let’s compare with gmm.

mclust <- Mclust(X, 3) from library(mclust):Once again, the plotting of the ellipses during iteration, as illustrated in the animation that follows, is not shown in the present code for clarity (see Appendix for the full code).

Comparing to the Iris species labelling, we get an accuracy of 96.

7% when classifying each observation to the highest cluster probability (no error for setosa and virginica, 5 observations misclassified for versicolor).

Note however that results may slightly vary depending on the random parameter initiation; run the algorithms several times to investigate this variability.

Similar results are obtained when using Mclust() (but Mclust() is obviously much faster).

If we compare in detail the results of both clustering algorithms, we can see that it is the spherical assumption made by K-means that makes it perform slightly worse than a GMM on the Iris dataset.

Gaussian Mixture Modelling of the Iris dataset | The different iterations to convergence, with elliptic distribution of each cluster plotted.

See now the resemblance between the two algorithms, both structured in three steps: initiation, E-step and M-step of the Expectation-Maximization (EM) algorithm [10].

EM is a simple and powerful iterative algorithm which alternates between inferring the cluster given the parameters (E-step), and then optimizing the parameters given the predicted cluster (M-step).

Voila.

Now that you understand the steps of those two classic clustering algorithms, I recommend that you use official R functions (kmeans(), Mclust(), etc.

), which have been benchmarked and are far more efficient.

[1] C.

M.

Bishop, Pattern Recognition and Machine Learning (2006), Springer[1] C.

M.

Bishop, Pattern Recognition and Machine Learning (2006), Springer[2] T.

Hastie et al.

, The Elements of Statistical Learning, Data Mining, Inference, and Prediction (2009), Springer, 2nd ed.

[3] K.

P.

Murphy, Machine Learning, A Probabilistic Perspective (2012), The MIT Press[4] E.

Anderson, The species problem in Iris (1936), Annals of the Missouri Botanical Garden[5] A.

Hald, A History of Mathematical Statistics (1998), Wiley[7] A.

K.

Jain, Data clustering: 50 years beyond K-means (2010), Pattern Recognition Letters[8] H.

Steinhaus, Sur la division des corps matériels en parties (1956), Bulletin de l’Académie Polonaise des Sciences[9] J.

MacQueen, Some methods for classification and analysis of multivariate observations (1967), Fifth Berkeley Symposium[10] A.

P.

Dempster et al.

, Maximum Likelihood from Incomplete Data via the EM Algorithm (1977), Journal of the Royal Statistical SocietyAppendix: Animated K-meanswd <- getwd()# finds partition such that squared error between empirical mean# and points in cluster is minimized over all k clusterskm.

fromscratch <- function(X, k, plot = F){ p <- ncol(X) # number of parameters n <- nrow(X) # number of observations Delta <- 1; iter <- 0; itermax <- 30 class_col <- c('#7DB0DD', '#86B875', '#E495A5') while(Delta > 1e-4 && iter <= itermax){ # initiation if(iter == 0){ centroid <- X[sample(nrow(X), k),] centroid_mem <- centroid } # equivalent to E-step d <- sapply(1:k, function(c) sapply(1:n, function(i) sum((centroid[c,] – X[i,])^2) )) cluster <- apply(d, 1, which.

min) # equivalent to M-step centroid <- t(sapply(1:k, function(c) apply(X[cluster == c,], 2, mean))) rad <- sapply(1:k, function(c) max(sqrt(d[cluster == c,c]))) if(plot){ i <- 1 idx <- matrix(rep(seq(p), p), ncol = p, nrow = p) idx <- idx[lower.

tri(idx)] idy <- matrix(rep(seq(p), each=p), ncol = p, nrow = p) idy <- idy[lower.

tri(idy)] theta <- seq(0,1,0.

01) * 2*pi png(paste0(wd, '/fig_kmeans/iter', iter, '.

png')) pairs(rbind(X, centroid), lower.

panel = NULL, asp = 1, col = c(class_col[cluster], rep('black', k)), main = paste0('iter=',iter), panel=function(x, y, .

off() } Delta <- sum((centroid – centroid_mem)^2) iter <- iter + 1; centroid_mem <- centroid } return(list(centroid = centroid, cluster = cluster))}# run K-meanskm <- km.

fromscratch(X, 3, plot = T)table(y, km\$cluster)library(magick)list.

files(path = paste0(wd, '/fig_kmeans/'), pattern = '*.

png', full.

names = T) %>% image_read() %>% image_join() %>% image_animate(fps=1) %>% image_write('fig_kmeans_anim.

gif')Appendix: Animated GMMlibrary(reshape) #cast()wd <- getwd()# Uses EM algorithm with multivariate normal# distribution to estimate cluster probabilitymvnorm.

cov.

inv <- function(Sigma) { # Eigendecomposition of covariance matrix E <- eigen(Sigma) Lambda.

inv <- diag(E\$values^-1) # diagonal matrix with inverse of eigenvalues Q <- E\$vectors # eigenvectors return(Q %*% Lambda.

inv %*% t(Q))}#multivariate Gaussian pdfmvn.

pdf.

i <- function(xi, mu, Sigma) 1/sqrt( (2*pi)^length(xi) * det(Sigma) ) * exp(-(1/2) * t(xi – mu) %*% mvnorm.

cov.

inv(Sigma) %*% (xi – mu) )mvn.

pdf <- function(X, mu, Sigma) apply(X, 1, function(xi) mvn.

pdf.

i(as.

numeric(xi), mu, Sigma))gmm.

fromscratch <- function(X, k, plot = F){ p <- ncol(X) # number of parameters n <- nrow(X) # number of observations Delta <- 1; iter <- 0; itermax <- 30 class_col <- c('#7DB0DD', '#86B875', '#E495A5') while(Delta > 1e-4 && iter <= itermax){ # initiation if(iter == 0){ km.

init <- km.

fromscratch(X, k) mu <- km.

init\$centroid; mu_mem <- mu w <- sapply(1:k, function(i) length(which(km.

init\$cluster == i))) w <- w/sum(w) cov <- array(dim = c(p, p, k)) for(i in 1:p) for(j in 1:p) for(c in 1:k) cov[i, j, c] <- 1/n * sum((X[km.

init\$cluster == c, i] – mu[c, i]) * (X[km.

init\$cluster == c, j] – mu[c, j])) } # E-step mvn.

c <- sapply(1:k, function(c) mvn.

pdf(X, mu[c,], cov[,, c])) r_ic <- t(w*t(mvn.

c)) / rowSums(t(w*t(mvn.

c)))# M-step n_c <- colSums(r_ic) w <- n_c/sum(n_c) mu <- t(sapply(1:k, function(c) 1/n_c[c] * colSums(r_ic[, c] * X))) for(i in 1:p) for(j in 1:p) for(c in 1:k) cov[i, j, c] <- 1/n_c[c] * sum(r_ic[, c] * (X[, i] – mu[c, i]) * r_ic[, c] * (X[, j] – mu[c, j]))cluster <- apply(r_ic, 1, which.

max) if(plot){ i <- 1 idx <- matrix(rep(seq(p), p), ncol = p, nrow = p) idx <- idx[lower.

tri(idx)] idy <- matrix(rep(seq(p), each=p), ncol = p, nrow = p) idy <- idy[lower.

tri(idy)] if(iter < 10) iter4plot <- paste0('0', iter) else iter4plot <- iter png(paste0(wd, '/figs_gmm/iter', iter4plot, '.

png')) pairs(rbind(X, mu), lower.

panel = NULL, asp = 1, col = c(class_col[cluster], rep('black', k)), main = paste0('iter=',iter), panel=function(x, y, .

) { points(x, y, col = c(class_col[cluster], rep('black', k))) xi <- seq(min(X[, idx[i]])-1, max(X[, idx[i]])+1, 0.

1) yi <- seq(min(X[, idy[i]])-1, max(X[, idy[i]])+1, 0.

1) grid <- expand.

grid(xi = xi, yi = yi) grid['z'] <- mvn.

pdf(grid, mu[1,c(idx[i],idy[i])], cov[c(idx[i],idy[i]),c(idx[i],idy[i]), 1]) z <- cast(grid, xi ~ yi) contour(xi, yi, as.

matrix(z[,-1]), levels = c(.

1, .

5, .

9), col = class_col[1], add = T, lty = 'solid', labels = '') grid <- expand.

grid(xi = xi, yi = yi) grid['z'] <- mvn.

pdf(grid, mu[2,c(idx[i],idy[i])], cov[c(idx[i],idy[i]),c(idx[i],idy[i]), 2]) z <- cast(grid, xi ~ yi) contour(xi, yi, as.

matrix(z[,-1]), levels = c(.

1, .

5, .

9), col = class_col[2], add = T, lty = 'solid', labels = '') grid <- expand.

grid(xi = xi, yi = yi) grid['z'] <- mvn.

pdf(grid, mu[3,c(idx[i],idy[i])], cov[c(idx[i],idy[i]),c(idx[i],idy[i]), 3]) z <- cast(grid, xi ~ yi) contour(xi, yi, as.

matrix(z[,-1]), levels = c(.

1, .

5, .

9), col = class_col[3], add = T, lty = 'solid', labels = '') i <<- i+1 }) dev.

off() } Delta <- sum((mu – mu_mem)^2) iter <- iter + 1; mu_mem <- mu } return(list(softcluster = r_ic, cluster = cluster))}gmm <- gmm.

fromscratch(X, 3, plot = T)table(y, gmm\$cluster)library(magick)list.

files(path = paste0(wd, "/figs_gmm/"), pattern = "*.

png", full.

names = T) %>% image_read() %>% image_join() %>% image_animate(fps=1) %>% image_write("fig_gmm_anim.

gif")Arnaud Mignan is a Catastrophe Risk Researcher who writes on Medium about the History, Risk and Forecast of Perils & about Machine Intelligence in general.