An Introduction to Bayesian Inferenceand Markov Chain Monte CarloNeel ParekhBlockedUnblockFollowFollowingJun 18Foreword: The following post is intended to be an introduction with some math.
Although it includes math, I’m by no means an expert.
I struggled to learn the basics of probabilistic programming, and hopefully this helps someone else make a little more sense out of the world.
If you find any errors, please drop a comment and help us learn.
Cheers!In data science we are often interested in understanding how data was generated, mainly because it allows us to answer questions about new and/or incomplete observations.
Given a (often noisy) dataset though, there are an infinite number of possible model structures that could have generated the observed data.
But not all model structures are made equally — that is, certain model structures are more realistic than others based on our assumptions of the world.
It is up to the modeler to choose the one that best describes the world they are observing.
GOAL: Find the model M that best explains how a dataset D was generatedFor example, we might believe that a model that predicts the probability of me eating gelato would be appropriate if we were to set it up so that the date and my location influence the temperature, which in turn influences the chances of me wolfing down dark chocolate gelato (the best kind).
But this only gives us the model structure.
We also need to know how much each of input influences each level of the model — these are the model parameters θ.
I love gelato.
Even after we have chosen which assumptions we’ve made about the world, the model structure itself so far is just abstract.
We would love to have a function that can assign probabilities to any possible gelato outcome for a given set of inputs, but with just an abstract model choice, we don’t yet know exactly which values to use for the parameters in the model we’ve chosen.
This is where the probability theory comes in.
REVISED GOAL: For a given model structure M that encompasses our assumptions about the structure of the world (i.
how we believe the data were generated), what are the values of the model’s parameters θ that best explain the observed data D if it had been generated under this chosen model?In other words, we might be interested in finding the values of θ that maximize the likelihood of having observed the ys we did given the Xs we did (i.
the likelihood of the observed data D).
This is called the maximum likelihood.
Back to the gelato: imagine you recorded every single day where I was and whether I ate gelato or not (weird, but maybe my doctor is concerned I’m eating too much gelato).
The maximum likelihood estimator would output the θs that maximize the likelihood function.
The likelihood is the probability that we would have observed the ys (my gelato consumption) and Xs (the date and my location) if those ys were actually generated with those θs and those Xs.
Maximizing the likelihood gives us the combination of θs that seemingly worked the best at explaining the data.
Often, we try to search through every single combination of values for model parameters θs and calculate the probability that the model with that specific set of model parameter values generated our observed data D.
Of course, it’s impossible to search through EVERY combination of θ in most practical scenarios, so in traditional machine learning we instead optimize this search with algorithms like gradient descent, which choose which values of θ to test next given how well this current combination of values did at explaining the observed data.
We then select and report the combination of θs that maximized the likelihood of our data.
One issue with this approach is that we are reporting a single combination of θs as our best estimate, but someone else who sees this report will have no idea how confident we are for each reported parameter value.
This is an issue because observed data is a sample of the true population and inherently noisy, and no matter how much data we collect, we should never be 100% confident in any point estimate of a model parameter.
Consider if we had observed a different or limited sample of the population (e.
what if I only remembered to record my gelato consumption 75% of the time, or only when I was in NYC?), would our reported θs change drastically or minimally?.Knowing the credible interval for each θ value embeds information about how representative we think our estimate is for a given model parameter based on the data we’ve seen, which may often represent the effect of a true physical phenomenon.
We also know very little about how knowledge of my gelato consumption might effect the probability of it having been a certain temperature or me being in a certain city (i.
inference in the reverse direction).
Bayesian InferenceIn order to address this issue, I’d like to introduce Bayesian Inference, where Bayes Rule is our mantra:Eqn 1: Bayes RuleLet’s see how it can be applied to our goal.
It would be ideal to know the probability distribution of a parameter value θ under the model structure we’ve chosen, given the data we’ve observed: P_M (θ | D).
Plugging this into Bayes’ Rule:Eqn 2Eqn 3We can now rewrite Eqn 2 as:Eqn 4If we choose values of θ to test for our pre-selected model, it’s really easy to calculate P_M (D | θ) because we know all the inputs/output/parameters to this modeling function (recall this is called the likelihood), and we also know the prior P(θ) because it represents our beliefs about a specific θ (e.
for a seemingly fair coin the θ is likely a normal centered on 0.
5, and depending on our beliefs of how fair the coin is, we might increase or decrease the variance of that normal).
The numerator of EQN 4 is quite simple to calculate for a chosen θ .
Ask us for those quantities for every single possible θ and we quickly realize we run into the same problem as before in traditional machine learning: we can’t sample every single θ!.In other words, the denominator of EQN 4 is nearly impossible to compute.
MCMCMarkov Chain Monte Carlo (MCMC) techniques were developed in order to intelligently sample θs rather than directly sum the likelihood and prior for every possible θ .
The main idea behind these techniques is similar to the θ update techniques we saw in traditional machine learning: we “update” with a new set of θ values based on our evaluation of the likelihood of the current set of θ values.
The big difference here is that we are sampling rather than updating — in other words, we are interested in the history of values we have explored.
This is important because we are no longer simply interested in knowing the single best estimate of each of the parameters (which is what we do in gradient descent), but rather we are interested in knowing the collection of “good” estimates for each of the parameters and how likely they are.
the probability distribution of θ given D).
Let’s delve into a high level overview of MCMC algorithms and why they work here.
MCMC cares about tracking two things:θ_current: a single θ we are currently interested intrace_θ: a list of all θ_currents in orderMCMC begins by choosing a random initial value of θ:θ_current = θ_0trace_θ = [θ_0]and calculates the likelihood and prior (i.
the numerator of Eqn 4).
The point of creating MCMC was that although the denominator is constant across all choices of θ, we can’t calculate the sum in the denominator directly, so let’s put that aside for now.
At this point, we have no idea whether our chosen θ_0 is a good choice.
We instead choose a proposal θ_1 (at this point, consider it magically chosen, but we’ll get back to that really soon) and if it produces a larger numerator in EQN 4 , we can agree it’s a better choice for θ.
This is because in EQN 4 the denominator is constant, no matter our choice of θ to plug into EQN 4.
Since we know θ_1 is better, let’s add it to our trace, update our current θ, and calculate the likelihood and prior (i.
numerator of EQN 4):θ_current = θ_1trace_θ = [θ_0, θ_1 ]We can now choose a new θ_2 to propose (still a magical process, for now) and if it’s not as good at explaining the data (i.
it produces a smaller numerator of Eqn 4) then we don’t immediately accept it this time, instead we accept it into our θ_current and trace_θ with probability ????:Eqn 5The reason for this is because we can reason that θ_2 is only ????.as good as θ_current.
This means that in our sampling history, we should expect the number of times we sample θ_2 to be a factor of ????.of the number of times we sample θ_current:Eqn 6In this regime, after sufficient sampling, you should have an intuition that we will have explored a lot of θ values and have saved them in the trace variable with frequency proportional to how well the θ value explains the data relative to other θs.
That’s exactly the distribution we were interested in!And lastly we return to the question of how to generate good proposal θs.
If we have no method and choose θs to propose at random among all possible values of θ, we will reject far too many samples (considering the true density of θ is probably relatively narrowly distributed).
Instead we draw θs from a proposal distribution q(θ_proposed | θ_current).
For example we might choose to make q a Normal with a fixed variance and mean θ_current.
This is the essence of one of the most popular MCMC algorithms called Metropolis-Hastings (MH).
We incorporate this into our acceptance ratio ????.in the following way:Eqn 7If we choose a symmetric proposal distribution (e.
the Normal distribution), then:Eqn 8and it follows that ????_r = ????.
This is called the Random Walk MH algorithm, and you’ll end up with plots like this:Image 1The left is a KDE plot (essentially a smoothed histogram) for an “intercept_mu” parameter, and the right is the trace over time for each sampling chain.
We would be able to infer that although the most likely value of the parameter (the mode of the trace, a.
the MAP) is 0.
16, the credible interval of the parameter is likely between 0.
1 and 0.
Appendix:For an easy visual explanation of this process, check out this video.
Most packages that help you do MCMC will generally run 3 or more traces (a.
“chains”) with different random initializations to ensure the chains are “converging”.
This is important because converging chains indicate your Markov Chain has stabilized.
I generally factor in a burn-in period, which discards the first few thousand samples that may depend on the random initialization of thetaThe choice of prior has influence on the outcome of the sampling.
Don’t choose far too narrow priors (as your bias will prevent proper exploration of the parameter space and your chains will fail to converge) or far too wide priors (called an uninformative prior, your chances of converging decrease too as you’ll spend too much time rejecting useless samples).
I usually use the mode or mean of the trace to report a most likely value, and the HPD (Highest Posterior Density) Interval in order to establish the credible interval.
.. More details