Statistical LearningThe hybrid Monte Carlo Methodin PythonDiogo RibeiroBlockedUnblockFollowFollowingJun 2Photo by Hyouk Seo on UnsplashThe Hamiltonian Monte Carlo method is a kind of Metropolis-Hastings method.

One of the weak points of Monte Carlo sampling comes up with random walks.

Hamiltonian Monte Carlo method (HMC) is an approach to reducing the randomizing in the algorithm of the sampling.

Hamiltonian Monte Carlo has proven a remarkable empirical success, but only recently have we begun to develop a rigorous understanding of why it performs so well on difficult problems and how it is best applied in practice.

Unfortunately, that understanding is confined within the mathematics of differential geometry which has limited its dissemination, especially to the applied communities for which it is particularly important.

In this review I provide a comprehensive conceptual account of these theoretical foundations, focusing on developing a principled intuition behind the method and its optimal implementations rather of any exhaustive rigor.

Whether a practitioner or a statistician, the dedicated reader will acquire a solid grasp of how Hamiltonian Monte Carlo works, when it succeeds, and, perhaps most importantly, when it fails.

The original name was the hybrid Monte Carlo method.

I prefer calling it the Hamiltonian Monte Carlo method.

Hamiltonian mechanics is one of the coolest topics you can learn in mechanics class.

I would not have introduced about the real amazing part of the Hamiltonian mechanics, but I still strongly recommend you to look into it from other references.

Graduate level of mechanics books must have the part.

We will approach the idea from the Metropolis-Hastings method with very narrow target density.

Metropolis-Hastings method reviewIn Metropolis-Hastings sampling, we use a proposal density Q(x,x’), which is a probability to transit from x to x’.

x’ is the probable next position.

If the proposal distribution is symmetric under the switch of position vectors, the proposal distribution becomes irrelevant at the procedures.

What matters is the relative proportion of P^* (x).

If P^* (x’) / P^* (x) is bigger than 1, just draw the x’ to the sample.

If x = x^(n), we set x’ = x^(n+1).

If P^* (x’) / P^* (x)$ is smaller than 1, draw the x’ to the sample with the probability, P^* (x’) / P^* (x).

Intuitively, if we keep doing this procedure, we will obtain the sample of P(x) up to some proportional constant.

import numpy as npimport randomimport matplotlib.

pyplot as plt%matplotlib inlineplt.

style.

use('ggplot')def P_star(X1,X2): return np.

exp(-(250.

25*X1*X1-2*249.

75*X1*X2+250.

25*X2*X2))# initial pointx0, y0 = np.

array([-1,1])Listx , Listy = [x0], [y0]# number of sampleN = 100for i in range(N): xn, yn = random.

uniform(-1,1),random.

uniform(-1,1) a = P_star(xn,yn)/P_star(x0,y0) if a >=1: Listx.

append(xn) Listy.

append(yn) x0,y0 = xn, yn else: if random.

random()<a: Listx.

append(xn) Listy.

append(yn) x0,y0 = xn, ynYou will see we define the proportion of the target densities.

a = P_star(xn,yn)/P_star(x0,y0)And then accept the above conditions.

I did not make a script for rejection respectively.

Not needed.

# the rate of the samplelen(Listx)/100.

0plt.

plot(Listx, Listy,'o')plt.

figure()plt.

plot(Listx, Listy,'o-', alpha=0.

4)x = np.

arange(-1, 1, 0.

01)y = np.

arange(-1, 1, 0.

01)X, Y = np.

meshgrid(x, y)plt.

contour(X,Y,P_star(X,Y))When the target density is narrow-shaped like the figure, the random samples get out of the zone more probably, and the Metropolis-Hastings sampler performs very poorly.

In the above example, only 14 % of samples are selected.

Hamiltonian Monte CarloIn statistical physics, the probability distribution of particles isEarlier, I wrote that the Hamiltonian Monte Carlo method reduces the randomness.

How do we diminish it?.We suggest the direction the sample move toward, by the leapfrog algorithm.

The leapfrog algorithm chooses the direction by learning the gradient of the energy.

1In the above 2-dimensional contour plot, the center of the ellipse is more probable.

It is very much similar to a map of a mountain.

Imagine that you stand at the edge of the mountain, and want to conquer the top of the mountain as soon as possible.

The sample run by leapfrog algorithm is in a similar situation.

It has the conserved dynamical energy for each turn, kinetic energy plus potential energy.

It feels the slope of the mountain and tends to go along where the slope is steeper.

def P_Hamiltonian(X1,X2,p1,p2): return np.

exp(-(250.

25*X1*X1-2*249.

75*X1*X2+250.

25*X2*X2)-p1*p1/2.

0-p2*p2/2.

0)def gradE(x): return np.

array([500.

50*x[0]-2*249.

75*x[1],-2*249.

75*x[0]+500.

50*x[1]])# initial pointx0 = np.

array([-1,1])Listx = [x0]# number of samplesN = 100T = 100epsilon = 0.

056for i in range(N): p0 = np.

random.

normal(0,1,2) for t in range(T): ph = p0 – epsilon* gradE(x0)/2.

0 xn = x0 + epsilon* ph pn = ph – epsilon* gradE(xn)/2 a = P_Hamiltonian(xn[0],xn[1],pn[0],pn[1])/P_Hamiltonian(x0[0],x0[1],p0[0],p0[1]) if a >=1: Listx.

append(xn) x0,p0 = xn, pn else: if random.

random() < a: Listx.

append(xn) x0,p0 = xn, pnfor t in range(T): ph = p0 – epsilon* gradE(x0)/2.

0 xn = x0 + epsilon* ph pn = ph – epsilon* gradE(xn)/2This for loop is where the leapfrog algorithm works, and it is nothing but a finite displacement of the momentum and positions in the continuous phase space consists of the position and the momentum vectors, (x, p) under the rule of the dynamics.

To understand how samples move in the dynamic system, we solve the equation of motion, Hamiltonian equation.

# the rate of samplelen(Listx)/100.

00.

61Now 61 % of samples are accepted.

x11=[Listx[i][0] for i in range(len(Listx))]x22=[Listx[i][1] for i in range(len(Listx))]plt.

figure()plt.

plot(x11,x22,'-',alpha=0.

4)plt.

scatter(x11,x22,c='red', s=30)x1 = np.

arange(-1, 1, 0.

01)x2 = np.

arange(-1, 1, 0.

01)X1, X2 = np.

meshgrid(x1, x2)plt.

contour(X1,X2,np.

exp(-(250.

25*X1*X1-2*249.

75*X1*X2+250.

25*X2*X2)))See how the Markov chain is sailing, started at (-1, ~1).

It goes toward the center quickly but passed the center because it was too fast, and gradually converges inside the target distribution density in a few turns.

The size (or length) of steps are controlled by the parameter, epsilon, and T.

If epsilon is too big, the differentiation won’t work approximately.

If it is too tiny, it will take too much time to make a convergent Markov chain over the target density.

In other words, the chain of the sample will move too slowly by tiny steps.

More exactly, it is potential energy.

The initial momentum to run the leapfrog algorithm is given randomly by the normal distribution N(0,1) in both dimensions, in the example.

For a more intuitive explanation, I gave an example of a mountain.

Nevertheless, a pit in the ground could be a better comparison, and then roll the ball in it.

The ball would not dig in the deepest hole at once.

It would roll back a few times and gradually converges to the hole.

Don’t get confused the total dynamical energy, potential plus kinetic energy of the sample is not conserved.

For each turn, the initial momentum is given randomly by the normal distribution N(0,1) in both dimensions, in the example.

.