Since we can estimate the log odds via logistic regression, we can estimate probability as well because log odds are just probability stated another way.
Notice that the middle section of the plot is linearWe can write our logistic regression equation:Z = B0 + B1*distance_from_basketwhere Z = log(odds_of_making_shot)And to get probability from Z, which is in log odds, we apply the sigmoid function.
Applying the sigmoid function is a fancy way of describing the following transformation:Probability of making shot = 1 / [1 + e^(-Z)]Now that we understand how we can go from a linear estimate of log odds to a probability, let’s examine how the coefficients B0 and B1 are actually estimated in the logistic regression equation that we use to calculate Z.
There is some math that goes on behind the scenes here, but I will do my best to explain it in plain English so that both you (and I) can gain an intuitive understanding of this model.
The Cost FunctionLike most statistical models, logistic regression seeks to minimize a cost function.
So let’s first start by thinking about what a cost function is.
A cost function tries to measure how wrong you are.
So if my prediction was right then there should be no cost, if I am just a tiny bit wrong there should be a small cost, and if I am massively wrong there should be a high cost.
This is easy to visualize in the linear regression world where we have a continuous target variable (and we can simply square the difference between the actual outcome and our prediction to compute the contribution to cost of each prediction).
But here we are dealing with a target variable that contains only 0s and 1s.
Don’t despair, we can do something very similar.
In my basketball example, I made my first shot from right underneath the basket — that is [Shot Outcome = 1 | Distance from Basket =0].
Yay, I don’t completely suck at basketball.
How can we translate this into a cost?First my model needs to spit out a probability.
Let’s say it estimates 0.
95, which means it expects me to hit 95% of my shots from 0 feet.
In the actual data, I took only one shot from 0 feet and made it so my actual (sampled) accuracy from 0 feet is 100%.
Take that stupid model!So the model was wrong because the answer according to our data was 100% but it predicted 95%.
But it was only slightly wrong so we want to penalize it only a little bit.
The penalty in this case is 0.
0513 (see calculation below).
Notice how close it is to just taking the difference of the actual probability and the prediction.
Also, I want to emphasize that this error is different from classification error.
Assuming the default cutoff of 50%, the model would have correctly predicted a 1 (since its prediction of 95% > 50%).
But the model was not 100% sure that I would make it and so we penalize it just a little for its uncertainty.
95) = 0.
0513Now let’s pretend that we built a crappy model and it spits out a probability of 0.
In this case we are massively wrong and our cost would be:-log(0.
05) = 2.
996This cost is a lot higher.
The model was pretty sure that I would miss and it was wrong so we want to strongly penalize it; we are able to do so thanks to taking the natural log.
The plots below show how the cost relates to our prediction (the first plot depicts how cost changes relative to our prediction when the Actual Outcome =1 and the second plot shows the same but when the Actual Outcome = 0).
So for a given observation, we can compute the cost as:If Actual Outcome = 1, then Cost = -log(pred_prob)Else if Actual Outcome = 0, then Cost = -log(1-pred_prob)Where pred_prob is the predicted probability that pops out of our model.
And for our entire data set we can compute the total cost by:Computing the individual cost of each observation using the procedure above.
Summing up all the individual costs to get the total cost.
This total cost is the number we want to minimize, and we can do so with a gradient descent optimization.
In other words we can run an optimization to find the values of B0 and B1 that minimize total cost.
And once we have that figured out, we have our model.
Exciting!Tying it All TogetherTo sum up, first we use optimization to search for the values of B0 and B1 that minimize our cost function.
This gives us our model:Z = B0 + B1*XWhere B0 = 2.
5 and B1 = -0.
2 (identified via optimization)We can take a look at our slope coefficient, B1, which measure the impact that distance has on my shooting accuracy.
We estimated B1 to be -0.
This means that for every 1 foot increase in distance, the log odds of me making the shot decreases by 0.
B0, the y-intercept, has a value of 2.
This is the model’s log odds prediction when I shoot from 0 feet (right next to the basket).
Running that through the sigmoid function gives us a predicted probability of 92.
In the following plot, the green dots depict Z, our predicted log odds.
Almost there!We are almost done!.Since Z is in log odds, we need to use the sigmoid function to convert it into probabilities:Probability of Making Shot = 1 / [1 + e^(-Z)]Probability of Making Shot, the ultimate output that we are after is depicted by the orange dots in the following plot.
Notice the curvature.
This means that the relationship between my feature (distance) and my target is not linear.
In probability space (unlike with log odds or with linear regression) we cannot say that there is a constant relationship between the distance I shoot from and my probability of making the shot.
Rather, the impact of distance on probability (the slope of the line that connects the orange dots) is itself a function of how far I am currently standing from the basket.
Nice!.We have our probabilities nowHope this helps you understand logistic regression better (writing it definitely helped me).
.. More details