Even with increasingly better computational tools, such as MCMC, models based on conjugate distributions are advantageous.Beta-BinomialOne of the better known examples of conjugate distributions is the Beta-Binomial distribution, which is often used to model series of coin flips (the ever present topic in posts about probability).While the Binomial distribution represents the probability of success in a series of Bernoulli trials, the Beta distribution here represents the prior probability distribution of the probability of success for each trial.Thus, the probability p of a coin landing on head is modeled to be Beta distributed (with parameters α and β), while the likelihood of heads and tails is assumed to follow a Binomial distribution with parameters n (representing the number of flips) and the Beta-distributed p, thus creating the link.p ∼ Beta(α,β)y ∼ Binomial(n,p)Gamma-PoissonAnother often-used conjugate distribution is the Gamma-Poisson distribution, so named because the rate parameter λ that parameterizes the Poisson distribution is modeled as a Gamma distribution:λ ∼ Gamma(k,θ)y ∼ Poisson(λ)While the discrete Poisson distribution is often used in applications of count data, such as store customers, eCommerce orders, website visits, the Gamma distribution serves as a useful distribution to model the rate at which these events occur (λ), since the Gamma distribution models positive continuous values only, but is otherwise quite flexible in its parameterization:Gamma DistributionsThis distribution is also known as the Negative-Binomial distribution, which we can think of as a mixture of Poission distributions.If you find this confusing, you’re not alone, and maybe you’ll start to appreciate why so often we try to approximate things using the good old Normal distribution…Dirichlet-MultinomialA perhaps even more interesting yet seemingly less talked-about example of conjugate distributions is the Dirichlet-Multinomial distribution, introduced in chapter 3 of BDA3.One way of think about the Dirichlet-Multinomial distribution is that while the Multinomial (i.e. multiple choices) distribution is a generalization of the Binomial distribution (i.e. binary choice), the Dirichlet distribution is a generalization of the Beta distribution..That is, while the Beta distribution models the probability of a single probability p, the Dirichlet models the probabilities of multiple, mutually exclusive choices, parameterized by a which is referred to as the concentration parameter and represents the weights for each choice (we’ll see more on that later).In other words, think of coins for the Beta-Binomial distribution and dice for the Dirichlet-Multinomial distribution.θ ∼ Dirichlet(a)y ∼ Multinomial(n,θ)In the wild, we might encounter the Dirichlet distribution these days often in the context of topic modeling in natural language processing, where it’s commonly used as part of a Latent Dirichlet Allocation (or LDA) model, which is a fancy way of saying we’re trying to figure out the probability of an article belonging to a certain topic given its content.However, for our purposes, let’s look at the Dirichlet-Multinomial in the context of simple multiple choices, and let’s start by throwing dice as a motivating example.Throwing Dice(If you want to try out the code snippets here, you’ll need to import the relevant Python libraries first. Or you can follow along with the Jupyter notebook accompanying this article.)import numpy as npfrom scipy import statsimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport pymc3 as pmLet’s first create some data representing 122 rolls of six-sided die, where p represents the expected probability for each side of a fair die, i.e..1/6.y = np.asarray([20, 21, 17, 19, 17, 28])k = len(y)p = 1/kn = y.sum()print(n, p)(122, 0.16666666666666666)Just looking at a simple bar plot of the data, we suspect that we might not be dealing with a fair die!sns.barplot(x=np.arange(1, k+1), y=y);Barplot of rolls of six-sided dieHowever, students of Bayesian statistics that we are, we’d like to go further and quantify our uncertainty in the fairness of the die and calculate the probability that someone slipped us loaded dice.Let’s set up a simple model in PyMC3 that not only calculates the posterior probability for theta (i.e. the probability for each side of the die), but also estimates the die’s bias for returning a 6..We will use a PyMC3Deterministic variable for that purpose, in addition to our unobserved (theta) and observed (results) random variables.For the prior on theta, we’ll assume a non-informative Uniform distribution, by initializing the Dirichlet prior with a series of 1s for the parameter a, one for each of the k possible outcomes..This is similar to initializing a Beta distribution as Beta(1,1), which corresponds to the Uniform distribution (more on this here).with pm.Model() as dice_model: # initializes the Dirichlet distribution with a uniform prior: a = np.ones(k) theta = pm.Dirichlet("theta", a=a) # Since theta[5] will hold the posterior probability # of rolling a 6 we'll compare this to the # reference value p = 1/6 to determine the amount of bias # in the die six_bias = pm.Deterministic("six_bias", theta[k-1] – p) results = pm.Multinomial("results", n=n, p=theta, observed=y)Starting with version 3.5, PyMC3 includes a handy function to plot models in plate notation:pm.model_to_graphviz(dice_model)Let’s draw 1,000 samples from the joint posterior using the default NUTS sampler:with dice_model: dice_trace = pm.sample(draws=1000) Auto-assigning NUTS sampler…Initializing NUTS using jitter+adapt_diag…Multiprocess sampling (4 chains in 4 jobs)NUTS: [theta]Sampling 4 chains: 100%|██████████| 6000/6000 [00:01<00:00, 3822.31draws/s]From the traceplot, we can already see that one of the theta posteriors isn’t in line with the rest:with dice_model: pm.traceplot(dice_trace, combined=True, lines={"theta": p})We’ll plot the posterior distributions for each theta and compare it our reference value p to see if the 95% HPD (Highest Posterior Density) interval includes p=1/6.axes = pm.plot_posterior(dice_trace, varnames=["theta"], ref_val=np.round(p, 3))for i, ax in enumerate(axes): ax.set_title(f"{i+1}")We can clearly see that the HPD for the posterior probability for rolling a 6 barely includes the value we’d expect from a fair die.To be more precise, let’s plot the probability of our die being biased towards 6, by comparing theta[Six] to p.ax = pm.plot_posterior(dice_trace, varnames=["six_bias"], ref_val=[0])ax.set_title(f"P(Theta[Six] – {p:.2%})");P(Theta[Six])Lastly, we can calculate the probability that the die is biased towards 6 by calculating the density to the right of our reference line at 0:six_bias = dice_trace["six_bias"]six_bias_perc = len(six_bias[six_bias>0])/len(six_bias) print(f'P(Six is biased) = {six_bias_perc:.2%}')P(Six is biased) = 95.25%Thus, there’s a better than 95% chance that our die is biased towards 6..Better get some new dice…!Polling #1Let’s turn our review of the Dirichlet-Multinomial distribution to another example, concerning polling data.In section 3.4 of BDA3 on multivariate models and, specifically the section on Multinomial Models for Categorical Data, the authors include a, little dated, example of polling data in the 1988 Presidential race between George H.W..Bush and Michael Dukakis.Spoiler alert for those not following politics back then: Bush won by a huge margin..Since 1988, no candidate in a Presidential election has managed to equal or surpass Bush’s share of the electoral or popular vote.(Image credit: https://commons.wikimedia.org/wiki/File:ElectoralCollege1988-Large.png)Anyway, back to the data problem!.Here’s the setup:1,447 likely voters were surveyed about their preferences in the upcoming presidential electionTheir responses were:Bush: 727Dukakis: 583Other: 137What is the probability that more people will vote for Bush over Dukakis?i.e..what is the difference in support for the two major candidates?We set up the data, where k represents the number of choices the respondents had:y = np.asarray([727, 583, 137])n = y.sum()k = len(y)We, again, set up a simple Dirichlet-Multinomial model and include a Deterministic variable that calculates the value of interest – the difference in probability of respondents for Bush vs..Dukakis.with pm.Model() as polling_model: # initializes the Dirichlet distribution with a uniform prior: a = np.ones(k) theta = pm.Dirichlet("theta", a=a) bush_dukakis_diff = pm.Deterministic("bush_dukakis_diff", theta[0] – theta[1]) likelihood = pm.Multinomial("likelihood", n=n, p=theta, observed=y)pm.model_to_graphviz(polling_model)with polling_model: polling_trace = pm.sample(draws=1000)Looking at the % difference between respondents for Bush vs Dukakis, we can see that most of the density is greater than 0%, signifying a strong advantage for Bush in this poll.We’ve also fit a Beta distribution to this data via scipy.stats, and we can see that the posterior of the difference of the 2 theta values fits a Beta distribution very nicely (which is to be expected given the properties of the Dirichlet distribution as a multivariate generalization of the Beta distribution)._, ax = plt.subplots(1,1, figsize=(10, 6))sns.distplot(polling_trace["bush_dukakis_diff"], bins=20, ax=ax, kde=False, fit=stats.beta)ax.axvline(0, c='g', linestyle='dotted')ax.set_title("% Difference Bush vs Dukakis")ax.set_xlabel("% Difference");Percentage of samples with bush_dukakis_diff > 0:diff = polling_trace["bush_dukakis_diff"]bush_dukakis_diff_perc = len(diff[diff>0])/len(diff) print(f'P(More Responses for Bush) = {bush_dukakis_diff_perc:.0%}')P(More Responses for Bush) = 100%Polling #2As an extension to the previous model, the authors of BDA include an exercise in chapter 3.10 (Exercise 2) that presents us with polling data from the 1988 Presidential race, taking before and after the one of the debates.Comparison of two multinomial observations: on September 25, 1988, the evening of a presidential campaign debate, ABC News conducted a survey of registered voters in the United States; 639 persons were polled before the debate, and 639 different persons were polled after..The results are displayed in Table 3.2.. More details