# Behind the Models: Beta, Dirichlet, and GEM Distributions

Behind the Models: Beta, Dirichlet, and GEM DistributionsBuilding Blocks For Non-Parametric Bayesian ModelsTony PistilliBlockedUnblockFollowFollowingMay 31In a future post I want to cover non-parametric Bayesian models — these models are infinite-dimensional and allow for expansive online learning.

But first I want to cover some of the building blocks: Beta, Dirichlet, and GEM distributions.

These distributions have several helpful properties that provide for a wide variety of machine learning uses in addition to non-parametric Bayes.

The Beta DistributionBeta Distribution — WikipediaThe Beta distribution takes two parameters — α, and β — and takes values between 0 and 1.

This bounded region makes the Beta a helpful distribution when analyzing probabilities or proportions.

In fact, the Beta distribution is the “conjugate prior” of the Binomial distribution.

A conjugate prior is a term used in Bayesian inference — recall that Bayes theorem allows us to generate a posterior prediction by updating prior distribution with data.

A conjugate prior means that the math behind that updating process works really nicely (to put it crudely) —the posterior distribution is a parametric distribution that is easily updated.

Without a conjugate prior we need to used more advanced sampling methods to describe the posterior distribution.

This conjugate prior property affords an intuitive meaning to the α and β parameters.

Imagine a repeated Bernoulli trial with unknown probability of success — our goal is to estimate that unknown probability of success as we view repeated samples.

At first we may assume that all probabilities are equally plausible (though ideally we could give our model a head start by assuming some probabilities were more likely than others).

The Beta distribution will describes our updated (“posterior”) probability of success at each step with the α parameter equal to the number of observed successes and the β parameter equal to the number of observed failures (adding 1 to each because they need to be >0).

In the simulation below, the blue line Beta(1,1) pdf is our starting point — it gives all probabilities equal weights.

The first trial is a success — this gives us the orange Beta(2,1).

The second and third trials are failures— green Beta(2,2) and red Beta(2, 3).

Then a success — purple Beta(3,3), and finally a failure — Beta(3,4).

You can pick up on the shape of the Beta using this — higher α’s move mass to the right, while relatively higher β’s move mass to the left.

For α and β greater than 1 the pdf shifts probability to the middle, and for α and β less than 1 the pdf shifts probability to 0 and 1.

Less intuitively but really cool, the Beta distribution describes the order statistics of a continuous uniform distribution over 0 to 1.

Specifically, the kth smallest of a sample size n is distributed as Beta(k, n + 1 — k).

The graph below shows the 5 order statistics from a 5-sample continuous uniform distribution: the minimum (blue), 25th percentile (orange), median (green), 75th percentile (red) and maximum (purple).

The Dirichlet DistributionJohann Peter Gustav Lejeune DirichletNamed for the debonair 19th century mathematician pictured above, the Dirichlet distribution is a multivariate generalization of the beta distribution, in fact it’s alternative name is the multivariate beta distribution (MBD).

The Dirichlet takes a vector of parameters, one for each variate (of which there can be 2 to infinity).

The output of the distribution is such that the sum of the variables always equals one — for example, in a 3-dimensional Dirichlet, x + y + z = 1.

Similar to the Beta distribution, setting all α’s to 1 gives us a uniform distribution— here is Dir(1,1,1) as a 3D scatter plot:α’s less than one (0.

1 here)pushes probability mass out to the edges of the distribution.

Another way of saying this is that the distribution favors one of the three variables being close to 1, at the cost of the other two.

α’s greater than 1 (10 here) push probability mass to the center, such that it favors equality among the 3 variables (they are all closer to 0.

33).

Setting one α higher than the others skews probability mass in it’s direction.

Here is Dir(1,1,10):The Dirichlet is the conjugate prior for the Categorical distribution (i.

e.

a discrete multi-variate distribution which works like a multi-value Bernoulli: draw a uniform, and rather than Yes/No for success, find the variate that corresponds to the uniform pull), and the Multinomial distribution (a multi-variate binomial distribution).

Each variate in a Dirichlet is beta distributed, conditionalDirichlet Process & The GEM DistributionA Dirichlet process is a special form of the Dirichlet distribution.

A common motivating example illustrates the Dirichlet distribution as a “stick breaking” process — recall that the sum of the variates is always 1.

0, so each Beta-distributed variate “breaks off” a part of the 1.

0 stick.

In the illustration above we draw from a Dir(1,1,1,1,1,1,1) — 7 variates.

Note that each variate follows a Beta distribution — here Beta(1, 6).

A Dirichlet process is a Dirichlet distribution with an infinite number of variates.

How can you parameterize a model that has an infinite number of variates, you may ask?.We can think of the Dirichlet as a recursive process via the stick-breaking illustration.

I may not need to know the full range of the Dirichlet for my analysis, but maybe only know which variate an observation of 0.

4 would fall into: using the bottom line in the graph above, I would only need to know the blue, red, green, and purple draws to get to 0.

4 — the light blue, orange, and purple draws are not needed.

Having an infinite number of parameters allows your model to continue learning in an online-fashion.

A common application is to clustering analysis: under a k-means clustering algorithm, the number of clusters needs to be defined ahead of time.

This works if the dataset is known an k can be tuned using the elbow-method or other criteria, but in an online learning application we may not be able to reliably tune k as each data point is obtained.

The Dirichlet process allows us to place new data points into new clusters dynamically as the data comes in.

Using the stick-breaking example, a green “cluster” only needs to be added when an observation above ~0.

25 is observed, purple only after ~0.

35 is observed, etc.

The GEM Distribution is a special case of the Dirichlet process.

Named for Griffiths, Engen, and McCloskey’s early work in this space, the GEM distribution is a Dirichlet process which takes one parameter — we could write the above Dir(1,1,1,1,1,1,1) as GEM(1).

There are statistical implications for this characterization, but practically speaking it is helpful to limit the distribution to one parameter for simplicity if no known variations are known between parameters.

ConclusionThis article was theory heavy — we covered helpful properties of the Beta, Dirichlet, and GEM distributions as they relate to Bayesian analysis, building up to non-parametric Bayes.

In future articles I will use these models in application heavy examples.