After I rescaled my features, these warnings went away and my algorithm was able to converge.
By reducing my features to the same scale, I can use the regression coefficients for these features (θ’s) to weigh their relative importance in my weight gain/loss.
This will be elaborated later in the interpretation section of my model.
After the above two steps, my feature matrix (X) is converted to a 2–D numpy array of dimension (46, 3), and my label vector (y) to a 1-D numpy array of dimension (46).
Left: feature matrix X (46, 3).
Columns (L to R): x_intercept, x_surplus, x_step.
Right: label array y (46)Implement batch gradient ascentStep 0: Initialize θ’stheta = np.
5])I initialize all my θ’s to 0.
The result is a 1-D numpy array theta of dimension (3)Step 1: For each data point, calculate the probability of weight gain using θ’s from step 0 and the sigmoid functionprob = 1 / (1 + np.
exp(-X @ theta))This is where numpy’s vectorized operations come in handy, as instead of calculating the probability of weight gain for each data point, numpy does it for all data points at once:X @ theta: by multiplying matrix X with vector theta , numpy essentially calculates the linear combinations of x’s and θ’s for each and every data point by taking the dot product of each row of X with the theta column (see bolded cells in the diagram).
1 / (1 + np.
exp(-X @ theta)): after linear combinations of x’s and θ’s for all data point are calculated, the sigmoid function is applied to each of them to get the final probabilities of weight gain for all data points.
Notice that the operations in this sigmoid function (/, +, -, np.
exp) all represent vectorized functions that numpy runs behind the scenes, which finally output the probability vector prob, a 1-D numpy array of dimension (46).
Step 2: For each feature, calculate the partial derivative of log-likelihood to its corresponding θ using the calculated probabilities from step 1gradient = (y – prob) @ XThis is also where numpy’s vectorized operations shine:y – prob: a straightforward element-by-element subtraction of the real label and the predicted probability for all data points, resulting in a 1-D numpy array of dimension (46) representing the differences.
(y – prob) @ X: before the y – prob differences vector (46) is multiplied with the feature matrix X, it is transposed by numpy behind the scenes into a row vector of dimension (1, 46), which allows its dimension to line up with that of X(46, 3).
This transpose is also called in numpy as “prepending 1 to dimension” of the column vector.
Once the dimensions are lined up, the vector-matrix multiplication between y – prob and X can occur: dot products are taken between the row vector of y – prob and each feature column of X (see bolded cells in the diagram).
This is indeed the summation of (y(i) — P(y(i)=1)) * x_j(i) over all data point i’s to get the partial derivative of the log-likelihood for feature j (Equation 2).
More impressively, this partial derivative can be calculated for all three features at once under this vector-matrix multiplication, resulting in a vector of partial derivatives called the gradient.
Technically, this vector should be of dimension (1, 3) from multiplying a (1, 46) vector with a (46, 3) matrix.
However, behind the scenes, the “prepended 1 (before multiplication) is removed” by numpy after multiplication, and the final gradient vector is a 1-D array of dimension (3).
These behind-the-scenes “contortions” that numpy applies to its arrays before and after multiplication can be referred from the documentation of numpy.
matmul, which implements the matrix multiplication operator @.
If the first argument is 1-D, it is promoted to a matrix by prepending a 1 to its dimensions.
After matrix multiplication the prepended 1 is removed.
matmul documentationStep 3: For each feature, update its θ by the partial derivative in step 2 multiplied by the learning rate αalpha = 0.
01theta = theta + alpha * gradientThis can’t be more straightforward: instead of updating each θ, we can update all 3 θ’s at once by multiplying the gradient vector of partial derivatives with some pre-defined learning rate alpha, and adding the product to the theta vector.
Step 4: Repeat step 1 to 3 until convergenceThis can be easily done by nesting the previous 3 lines of code— one for each step (see bolded code block below)— into a for loop that iterates many times.
Presented below is the entire algorithm for batch gradient ascent running for 100 iterations (isn’t it amazing how simple the implementation can be?!):theta = np.
5])alpha = 0.
01for _ in range(100): prob = 1 / (1 + np.
exp(-X @ theta)) gradient = (y – prob) @ X theta = theta + alpha * gradientChecking for convergenceOne way to check for convergence of the algorithm is to see if the difference in log-likelihood stays below a some small tolerance level for the past few iterations of the loop, indicating that the log likelihood has likely reached its maximum.
Using the dot products of y and prob(and of their respective complements), the log-likelihood can be computed simply as:log_likelihood = y @ np.
log(prob) + (1 – y) @ np.
log(1 – prob)Another (perhaps more fun) way is to run the algorithm for some number of iterations and visualize whether the log-likelihood has reached its likely maximum.
Below is a visualization of 60 iterations of the batch gradient ascent algorithm with learning rate α = 0.
01:As seen from the above animation, the average log-likelihood (log-likelihood divided by number of training data, see middle panel) rises quickly for the first 10 iterations but starts to plateau after that.
At the 60th iteration, the difference in average log-likelihood between iterations is in the order of 10^-5, which indicates a good enough convergence.
This also corresponds to the convergence of the regression coefficients (θ’s) in the left panel.
Another way to visualize this convergence is through the classification boundary (see right panel).
The classification boundary, also called the decision boundary, represents the line where the predicted probability of weight gain is 50%: every point above it will have predicted probability below 50% — hence classified as weight loss — and every point below it will have probability above 50% — hence classified as weight gain.
Over these 60 iterations, the decision boundary seems to settle down to a reasonable boundary that separates most of the weight loss days (green) from the weight gain days (red).
Now that we have a converged logistic regression model that classifies my training data pretty well (at least from eyeballing the decision boundary), let’s see how we can improve on it.
Model improvementChoose the right learning rateGiven a fixed number of iterations, the value of learning rate (α) can determine whether the algorithm will converge after these iterations, or even if it will converge at all:When α is reduced to 0.
001 (from the original 0.
01), learning of θ’s happens much slower, and after 60 iterations, the average log-likelihood still shows signs of increase.
Therefore, the number of iterations should be increased at this small learning rate, or the learning rate itself should be bumped up.
However, if the learning rate is too high, the θ’s could be “over-corrected” and bounce around the optimum point after each update.
This can be seen from the convergence animation below when α = 1, which doesn’t seem to converge after 60 iterations.
Therefore, the learning rate should be reduced in those cases.
With α = 0.
01 as the sweet spot for our learning rate, we can of course always increase the number of iterations to make sure our model converges well and good to the maximum log-likelihood.
Indeed, at 1000 iterations, the difference in average log-likelihood between iterations is effectively zero.
Reduce overfit using ridge regressionEven though my logistic regression model has converged to the max log-likelihood of my training data, it might have overfit that training data i.
it “explained” the data a little too well.
As a result, the model might predict well on past data that I’ve collected in 2018, but might be terrible if I were to use it to predict my weight gain in 2019.
One solution to reduce the overfitting of logistic regression is to used the L2-regularized version of the regression (also called ridge regression), which subtracts the original log-likelihood by a term consisting of squares of the θ’s:m: number of training data points, n: number of featuresAs a result, maximizing the above function is equivalent to maximizing the log-likelihood of training data as much as possible while keeping θ’s low (as higher θ’s will lower L).
The lambda symbol (λ) represents the degree that θ’s are kept low (often called the regularization hyperparameter of the model).
When λ = 0, ridge regression returns exactly to the original, non-regularized logistic regression.
In terms of implementation of ridge regression, the only difference from the original logistic regression is in the calculation of partial derivatives (equation 2) of batch gradient ascent, where a λ*θ_j regularization term is subtracted from the partial derivative of each feature j:One caveat is that this regularization is not often done to the intercept θ, so the partial derivatives of θ_intercept is calculated the same as the non-regularized version i.
without subtracting this λ*θ term.
This implementation is easily integrated into our existing Python code by:Multiplying theta with the regularization hyperparameter lambda_reg to get reg_term — the λ*θ regularization term of dimension (3).
Setting the first element of reg_term to zero to represent the fact that θ_intercept is not regularizedSubtracting reg_term from (y – prob) @ X to find the gradient.
Below is the code for ridge regression with λ = 10, with modifications to the original algorithm shown in bold:theta = np.
5])alpha = 0.
01lambda_reg = 10for _ in range(100): prob = 1 / (1 + np.
exp(-X @ theta)) reg_term = lambda_reg * theta reg_term = 0 gradient = (y – prob) @ X – reg_term theta = theta + alpha * gradientWhen convergence is monitored for this ridge regression at 60 iterations, we can see that:At λ = 10, the θ for my two main features (calories surplus and step count, see left panel) converge to values much closer to zero i.
lower in magnitude compared to the non-regularized version (λ = 0).
The convergence of the intercept θ, however, is largely unaffected.
The average log-likelihood of ridge regression converges to a lower value than that of the non-regularized version (see middle panel), indicating that ridge regression provides a less perfect fit to my training data, but this might also mean it overfits my training data less.
The decision boundary of the ridge regression is a little off from the non-regularized boundary (see right panel).
However, it still seems to separate my training data points (red from green) pretty well.
As λ increases, the learned regression coefficients (θ) are further squeezed towards zero, except that of the intercept (see left panel below).
Furthermore, ridge regression becomes less and less effective in classifying my training data, as seen from the decision boundaries: the decision boundary at λ = 1 is quite close from the non-regularized boundary, while at λ = 100, the boundary is virtually unusable (see right panel).
However, the purpose of ridge regression is not to improve the fit on training data (because if so, it will always perform worse than the non-regularized version, as seen above).
Rather, it is to improve the prediction on new data i.
data that the ridge regression had not been trained from.
To compare ridge regression with its non-regularized counterpart, I use 2-fold cross-validation as outlined below:Split the 46 data points randomly into 2 equal folds/parts: A & B (each 23 data points)Train ridge regression on part A — the training set — and record the recall on the training set (days with correctly predicted weight gain / days with true weight gain in part A)Use the θ’s trained on part A to predict whether I would gain weight on the days of part B — the validation set — and record the recall on the validation set (days with correctly predicted weight gain / days with true weight gain in part B)Repeat step 2 and 3 with the parts switched i.
part B is now the training set, and part A the validation setAverage the training set recalls across both trials, and similarly for the validation set recallsWhy use recall?There are two fundamental metrics to measure how well a classifier works: precision and recall.
In this context:Personally, I would not care if the classifier predicts that I would gain weight while it turns out I don’t; in fact, that would even be a welcome surprise!.Therefore, I’m not too concerned about false positives among the predicted weight gain days.
In other words, precision is not that important to me.
On the other hand, I’m more likely to obsess over whether the classifier would capture all the days where I truly end up gaining weight, lest they escape as false negatives (classifier predicts I would lose weight when it’s the exact opposite).
In other words, I will try to improve recall as much as possible.
Below are the average train and validation recalls at different levels of regularization, with λ ranging from 0 (non-regularized) to 10:Average train & validation set recall (across 2 folds) at different λ valuesFrom the above graph, the average training set recalls stay at a constant level from the start (λ = 0), and only starts to drop at λ near 10.
This is consistent with the earlier observation that the performance of ridge regression on training data gets worse as λ increases.
On the other hand, the average validation set recall has a noticeable bump for λ between 0.
1 and 1, indicating that the ridge regression at these λ’s performs better than its non-regularized counterpart on new validation data, even though both perform equally well on training data.
Upon further inspection, it turns out a true weight gain day (red with black border, see below graph) in one of the validation sets was incorrectly classified as weight loss by the non-regularized regression: it stays above the (solid) decision boundary.
On the other hand, when λ = 0.
5 — halfway between 0.
1 and 1 — the decision boundary twists slightly.
As a result, this point stays below this (dashed) decision boundary, and was correctly classified.
This was the sole reason that ridge regression at λ = 0.
5 has better recall on the validation set compared to its non-regularized counterpart.
Although these results suggest I should choose the λ value with the highest average validation set recall (λ = 0.
5), the small number of data points in the validation set (23), within which there are even smaller number of weight gain points (red), suggests that this improved performance might just be due to luck.
This also explains why for some λ’s, the performance on the validation set is higher than that of the training set at the same λ, even though the opposite usually happens; I could just have a “lucky” validation set.
That said, there is no harm in choosing λ = 0.
5, given that when trained on my entire data of 46 points, its decision boundary is virtually indistinguishable from that of λ = 0, as seen from the earlier graph on decision boundaries at different λ’s (from 0 to 100).
Therefore, I choose to stay with λ = 0.
5 for my final model.
Choose threshold for decision boundaryChoosing λ is not the most the most influential decision in tweaking my model.
Rather, choosing the threshold for my decision boundary is:For all the regression models that I have built thus far, the classification threshold, hence the decision boundary, is set at 50% (or 0.
This a reasonable threshold, as days with predicted probability of weight gain above 50% should naturally be classified as days with weight gain, and vice versa.
However, when this weight gain threshold is lowered, more and more data points will be classified as weight gain (correctly or not).
Recall, which also goes by true positive rate, will of course increase, but false positive rate — days mistakenly predicted as weight gain out of days with true weight loss — will increase in lockstep (see the left panel, and also the ROC curve in the middle panel below).
Nevertheless, as mentioned earlier, since I’m not too worried about false positives, I can tolerate many weight loss days mistakenly classified as weight gain i.
high false positive rate, if that means my true weight gain days are better detected i.
high true positive rate/recall.
If that’s the case, should the threshold be set as low as possible, even to 0%?.No, because when this threshold is lowered, the decision boundary shifts up.
For example, for recall to increase to the nearest upper level, the threshold must decrease from 50% to 44% (from black to brown in the left panel).
As a result, the decision boundary shifts up to capture one more weight gain point (red point with brown border in the right panel).
This corresponds to a leftward shift of 107 kcal i.
if I were at the 50% boundary before, I have to eat 107 kcal less at the same step count in order to stay on the good weight loss side of the 44% boundary.
At the extreme end where threshold is 0% (orange point in the left panel), the boundary shifts way up so that it captures all weight gain points, including the topmost red point with orange border (right panel).
This decision boundary dictates that on my lazy day, where I average nearly 2500 steps, I should eat 1740 kcal below my budget to stay on the good side.
Given that the average budget in my data is around 1715 kcal, this translates to an upper limit of -25 calories on those lazy days (yes you read that right).
Sure, my recall would be amazing at this physics-defying limit, but I would be dead!So should I reduce my threshold after all?.Given that the 50% decision boundary learned from my data is already quite conservative — on lazy days, it suggests that I should eat around 140 kcal below the usual 1700+ kcal budget — I’ve decided to stick with the default 50% threshold.
Dropping that threshold to the next possible level of 44% would gain me few more points on recall, but the extra 107 kcal restriction is not worth losing my sanity over.
Model interpretationTo recap, we have trained a logistic regression classifier via the gradient ascent algorithm, using my daily calories surplus and step counts as features, and my weight gain status as labels.
The parameters for our model are:Learning rate: α = 0.
01Regularization parameter: λ = 0.
5Threshold: 50%After training the classifier on the entire data set, the learned regression coefficients (θ’s) come out to:Odds ratio analysisOne common interpretation of the logistic regression coefficients is through odds ratio: how many times the odds of the outcome changes for one unit change of the feature.
For a feature j, the odds ratio of that feature is just the exponential of the its θ.
Recall that logistic regression is performed on the normalized values of calories surplus (found by subtracting the surplus mean and dividing by the surplus standard deviation).
Therefore, θ_surplus = 0.
9 implies that at any step count, a standard deviation decrease in calories surplus — about 420 kcal — corresponds to e^0.
9, or 2.
5 times decrease in the odds that I will gain weight.
On the other hand, with θ_step = -1.
2, a standard deviation increase in my step count — about 5980 steps — corresponds to e^1.
2, or 3.
3 times decrease in the odds that I will gain weight (at any amount of calories surplus).
From my last 10 workouts, I have an average cadence of 1366 steps/km.
As a result, these 5890 steps translates to about 4.
In other words, a standard deviation increase in step counts (5980 steps) is 36% more effective in reducing my weight gain odds than a standard deviation decrease in calories consumed (420 kcal).
Of course, my willingness to run 4.
4 km instead of not eating that 420-kcal bowl of pho is another matter entirely!Although odds ratio offers some insights in how I should strategize my weight loss, a more actionable plan could be drawn from the decision boundary of my logistic regression classifier.
Decision boundary analysisRecall our trusty sigmoid function from earlier:Superscript (i) removed for simplicityIt can easily be seen that for the probability of weight loss to be 50% (where the classification threshold is), the linear combination of θ’s and x’s must be zero* (I’ve also replaced x_intercept by 1):50% decision boundary formula for normalized features* For other probability thresholds, the linear combination of θ’s and x’s can be found by taking the logit of the threshold: ln(p/(1-p))However, the x’s used in our logistic regression were the normalized values of our original features.
Therefore, we can rewrite the above equation as:* denotes the original feature value.
μ: feature mean, and σ: feature standard deviationRearranging, we have:50% decision boundary formula for original featuresPlugging θ’s and the feature means (μ) and standard deviations (σ) to the above equation gives the linear equation between the original calories surplus and step count features:Graphically, this equation represents the decision boundary in the original surplus-step plot (see left panel below).
From this decision boundary, there are 3 important numbers that I should keep in mind (see right panel):1) -140 kcalThis is how much I should be eating below the calories budget suggested by the Loseit app on my lazy days (in which I average 2480 steps).
With the average budget of just above 1700 kcal, this means on those days I should be eating below 1560 kcal on average.
This number sounds quite restrictive indeed.
2) 130 kcalHowever, one saving grace is that according to the decision boundary, for any 1 km I run beyond my normal activity, I am allowed to increase this limit by 130 kcal.
For example, if I had scheduled a 5K run on a given day, I can afford to eat -140 + 130 * 5 = 510 kcal above the calories budget from the app for that day.
Hopefully this will encourage me to keep up my running schedule.
3) 1070 stepsOn the flip side, on any given day when I’m tempted to overeat beyond the calories limit (as dictated by the first two rules), any 100 kcal that I eat beyond the limit must be earned with at least 1070 steps.
This can be done either by:Walking that 1070 steps around the block before chowing down on that banh bao, orRealizing I’m too lazy for that and putting said food out of my stupid mouth (hey, I just built my own Slap Chef!)HIt me with that number baby!ConclusionI hope to use the above guidelines to be more successful in my weight loss journey in 2019 than in 2018.
Of course I won’t always succeed even with the numbers behind me, but the most important lesson I’ve learned from this project is: I should be kind to myself.
For example, I do lose weight even if I eat more than the budget on my active days, so I should not feel guilty for doing so.
Hopefully, with this new model that incorporates both diet and exercise, I can feel less guilt and shame (like I did before) during my ongoing weight loss journey.
Other lessonsAnother big reason that I embarked on this project is to implement a machine project without using pre-existing libraries such as scikit-learn.
Below are some lessons I learned from that process:When I first learned Python (or any serious programming for that matter) more than a year ago, I was not sure why one would ever use classes.
Well, after few dozen times during this project of (a) using the global variable theta, (b) getting some weird result like non-convergence, and (c) realizing said theta belongs to some other model that I ran many moons ago, I now realize why encapsulation in classes is so important: a model object could have its own attributes (theta, alpha, lambda_reg) and methods (fit, predict) that do not clash with those from other model objects, and I can merrily pick it up in one piece the next time I need it.
I’m really interested on how classes could be used to model data science problems, and I think with more relevant examples I might be able to appreciate more the powers of object-oriented programming (for now I’m still not quite sure when or how I should use them).
— Me when first learning PythonThis project also allows me to implement and understand the practical benefit of some neat programming concepts that I never had the chance to use, such as using generators i.
yield statements to return my training and validation folds one at a time, evaluate my model on those, then go back and generate more folds instead of storing all folds in memory at once (although for my data size it makes virtually no difference).
Even something as CS101 as writing clear pseudocode turns out to be quite important, especially when implementing mathematically-flavored algorithms like logistic regression; it’s embarrassing to admit the hours I’ve spent debugging from not writing explicit pseudocode beforehand and swapping alpha and theta by mistake, even leading me to research some fancy updating algorithm for gradient ascent to fix it (which thankfully I did not have to implement).
Lastly, this project helps me grok in more details the implementations of some of the machine learning libraries that I often used, such as the occasional warning that I have to specify either max_iter or tol when training a scikit-learn model: the former to specify the number of iterations to let the model converge, the latter to specify the tolerance level below which iterations are stopped — exactly the two choices I faced when checking for convergence for my own model.
Another example is that I now learn some of the scikit-learn models do not behave like I would expect them to: SGDClassifier(loss=’log’, penalty=’l2', learning_rate=’constant’) does not seem to shrink the θ of the intercept, and gave similar θ’s to my model, while LogisticRegression(penalty=’l2') shrinks the θ of intercept at the default setting, unless one fiddles with the intercept_scaling parameter.
As a result, I will be more careful when using third-party libraries to analyze my data in the future, and validate the result when necessary.
Further resourcesI took most of the derivations for logistic regression and its gradient ascent method from the corresponding lecture note and video lectures of CS229 (a machine learning course taught by professor Andrew Ng that I highly recommend).
The note provides the formula for stochastic gradient ascent, but the batch version can be easily modified from that (as seen in the math review section earlier).
A more accessible explanation can be found in week 3 of his machine course on Coursera.
This course also covers the application of ridge regression on logistic regression, which the CS229 course does not.
One small detail: the Coursera course refers to logistic regression as minimizing the log-loss using (among other methods) stochastic gradient descent.
However, this is identical to maximizing the log-likelihood using stochastic gradient ascent that I had implemented, with the log-loss nothing more than the negative of the log-likelihood with some minor modifications.
Another good explanation for logistic regression and ridge regression is from the University of Washington’s Coursera course on classification methods.
This course thankfully uses log-likelihood maximization to explain logistic regression so it should be consistent with the notations from CS229 and mine.
Last but not least are two weight loss analysis projects I found on the web that really inspired my own project: one from Will Koehrsen, who uses only his past weights to forecast future weight loss, and at the other end of the spectrum is one from Ariel Faigon, who uses dozens of factors to predict how much weight he gains or loses each day, and comes up with very interesting results: apparently sleep is the most important factor for his weight loss!.My approach is somewhere in the middle, using only 2 features to predict whether I would lose weight or not, which allows for easy visualization and simple, actionable insights that I can take away for my weight loss journeyI hope that my project can inspire others to use machine learning and data science to help them understand more about themselves and accomplish personal goals such as weight loss.
Please don’t hesitate to contact me on Medium if you have any question or feedback!.