If you’re familiar with calculus, you’ll know that you can find the maximum of a function by taking its derivative and setting it equal to 0.

The derivative of a function represents the rate of change of the original function.

If you look at the log-likelihood curve above, we see that initially it’s changing in the positive direction (moving up).

It reaches a peak, and then it starts changing in a negative direction (moving down).

The key is that at the peak, the rate of change is 0.

So if we know the functional form of the derivative, we can set it equal to 0 and solve for the best parameters.

Coin Flip MLELet’s derive the MLE estimator for our coin flip model from before.

I’ll cover the MLE estimator for our linear model in a later post on linear regression.

Recall that we’re modeling the outcome of a coin flip by a Bernoulli distribution, where the parameter p represents the probability of getting a heads.

First, let’s write down the likelihood function for a single flip:Likelihood function for Bernoulli distribution.

I’ve written the probability mass function of the Bernoulli distribution in a mathematically convenient way.

Take a second to verify for yourself that when x=1 (heads), the probability is p, and when x=0 (tails), the probability is (1-p).

Now, let’s assume we see the following sequence of flips:X = heads, heads, tails, heads, tails, tails, tails, heads, tails, tails.

Since the coin flips are iid, we can write the likelihood of seeing a particular sequence as the product of each individual flip:Likelihood of a sequence of flips.

Plugging in our data, we get:L(p) = p ⋅ p ⋅ (1-p) ⋅ p ⋅ (1-p) ⋅ (1-p) ⋅ (1-p) ⋅ p ⋅ (1-p) ⋅ (1-p).

Notice, that for every heads we get a factor of p, and for every tails a factor of (1-p).

Let’s generalize this to n coin flips with h heads:Likelihood for n coin flips with h heads.

We want to find the p that maximizes this function.

To make our job easier let’s take the log of both sides.

This will bring the exponents down, and will turn the product into a sum.

Taking the derivative of sums is easier than products (another convenience of log-likelihood).

Remember, we can do this because the p that maximizes the log-likelihood is the same as the p that maximizes the likelihood.

Our log-likelihood is:Log-likelihood of n coin flips with h heads.

To find the maximum we’re going to take the derivative of this function with respect to p.

If you’re not comfortable with calculus, the important thing is that you know the derivative is the rate of change of the function.

In this case, the derivative is:Derivative of the log-likelihood with respect to p.

We set the derivative equal to 0 to find the maximum of the function (where the rate of change is 0).

Setting the above equation equal to 0 and solving for p (try doing this yourself) gives us:The MLE estimate of p is the number of heads divided by the number of flips!It turns out that the Maximum Likelihood Estimate for our coin is simply the number of heads divided by the number of flips!.This makes perfect intuitive sense, if you flipped a fair coin (p = 0.

5) 100 times, you’d expect to get about 50 heads and 50 tails.

ConclusionMaximum Likelihood Estimation is a powerful technique for fitting our models to data.

The solutions provided by MLE are often very intuitive, but they’re completely data driven.

This means, that the more data we have, the more accurate our solutions become and vice versa.

In a future post, we’ll look at methods for including our prior beliefs about a model, which will help us in low data situations.

See you next time!.. More details