# Machine Learning From Scratch: Logistic Regression

For instance, we could, depending on our projects’ requirements, set Y=0 if P≤0.5 and Y=1 if P>0.5.All that’s left to do now is replacing the x in the sigmoid formula above with our regression equation:Let’s define a function for that:Gradient DescentWe are going to use stochastic gradient descent to find our optimal parameters..In stochastic gradient descent, as opposed to batch gradient descent, we are only going to use a single observation to update our parameters..Apart from that, the process is basically the same:Initialize the coefficients with zero or small random valuesEvaluate the cost of these parameters by plugging them into a cost functionCalculate the derivative of the cost functionUpdate the parameters scaled by a learning rate/step sizeTo get a better understanding of this rough outline of gradient descent, let’s look at the Python code.The first step of gradient descent consists of initializing the parameters with zero or small random values..In our case, we have to initialize beta0, beta1, and beta2:Now that we have initialized our betas, we can actually use the sigmoid function we defined earlier and understand what it does..By inputting our first training observation, we get the following result:What does our output mean?.The sigmoid function returns a probability..In our case, we haven’t defined a cutoff probability and our betas are all zeros..Thus, the output probabilities of the first training observation belonging to class 1 or 0 are equal..To get better predictions, we’re going to use stochastic gradient descent..To do so, we’re going to have to update our parameters:Making PredictionsFunctions make our life easier, however, we would still have to repeat this process manually for each of our observations..That doesn’t sound very fun, does it?Since we’ve defined a few handy functions, we can just put all of them together and loop through our training observations..Note that this works fine with our small dataset, however, it would very likely be a bottleneck when using larger datasets.Let’s walk through this: the parameters we have to define for our function are the number of epochs, the learning rate, and the cutoff probability.. More details