Photo by Janus Clemmensen on UnsplashLog Transformation Base For Data Linearization Does Not MatterA simple derivation explaining why log base has no significant effect when linearizing dataJeremy ChowBlockedUnblockFollowFollowingJun 27Code for this demonstration can be found here:Today a colleague asked me a simple question:“How do you find the best logarithm base to linearly transform your data?”This is actually a trick question, because there is no best log base to linearly transform your data — the fact that you are taking a log will linearize it no matter what the base of the log is.
My colleague was skeptical and I wanted to brush up on my algebra, so let’s dive into the math!PremiseLet’s assume you have exponential data.
This means your data is in some form similar to the following:(1)This means our data is non-linear.
Linear data is arguably the best form of data we can model, because through linear regression we can directly quantify the effects of each feature on the target variable by looking at its coefficient.
Linear regression is the best type of model for giving humans an intuitive and quantitative sense of how the model thinks our dependent variable is influenced by our independent variables versus, for example, the black boxes of deep neural nets.
DerivationSince we know the base here is e, we can linearize our data by taking the natural log of both sides (ignoring the constant C₁):(2)Now if we plot ln(y) vs.
x, we get a line.
That’s pretty straightforward, but what happens if we didn’t know that the base of our power was e?.We can try taking the log (base 10) of both sides:(3)but it doesn’t seem to look linear yet.
However, what if we introduce the logarithm power rule?(4)But log(e) is a constant!.therefore we have:(5)This means that our base 10 log is still directly proportional to x, just by a different scaling factor C, which is the log of the original base e in this case.
What does it look like?We can also visualize this with some python code!import numpy as npimport matplotlib.
pyplot as plt# Set up variables, x is 1-9 and y is e^xx = list(np.
exp(i) for i in x]# Plot the original variables – this is barebones plotting code, you # can find the more detailed plotting code on my github!plt.
plot(x,y)# Plot log base 10 of yplt.
log10(y))# Plot log base e of yplt.
log(y))They are both linear, even though the logarithms have different bases (base 10 vs base e)!The only thing that changed between the two logarithms was the y-scale because the slopes are slightly different!.The important part is that both are still linearly proportional with x, and thus would have equal performance in a linear regression model.
ConclusionIn summary: If you have exponential data, you can do a log transformation of any base to linearize the data.
If you have an intuition for the base from domain knowledge, then use the correct base — otherwise it doesn’t matter.
Side Note: Other TransformationsWhat if your data is in the slightly different form of x raised to the power of some unknown λ?(6)In this case, a Box-Cox transformation will help you find the ideal power to raise your data to in order to linearize it.
I recommend using Sci-py’s implementation.
That’s all for this one, thanks for reading!.