Best exponential transformation to linearize your data with ScipyHow to find the best exponent to fit a linear relation with scipy optimization package.

Teo ArgentieriBlockedUnblockFollowFollowingFeb 20An iterative search is necessary for any application in which we would like to find an optimum, but the solution to the problem is not expressible in an explicit form.

For instance, there are plenty of algorithms in machine learning that use iterative approaches to find the best set of parameters, like Lasso linear regression, gradient boosting machines, etcetera.

In this article, we will try to use a numerical approach in the ETL process, by transforming a non-linear relationship between two variable in a linear one with the optimum exponential transformation.

As a Data Scientist, I often have to check the relationship between different variables and summarize some key indicator with them.

I recently came across a project for the evaluation of motor efficiency, where I would like to express a sort of fuel consumption/speed ratio during a conveyance lifetime.

The relation between the case study variables was non-linear and monotonically increasing, so I started searching on google if there is a statistical test that can exploit a transformation on my data to make it more linear, like the box-cox for the normality.

At this point, I would like to perform an experiment: an iterative process that linearizes my data by minimizing a cost function.

In my ordinary work, I often make use of scipy.

optimize module to find function minima, why don’t use it for other purposes?You can better read of scipy.

optimize in the official documentation that provides useful explanation and examples.

Starting settingsIn my search, I have focused on exponential transformation because we can easily set the exponent as a parameter and provide a continuous range to explore.

Although this choice excludes some strongly non-linear bounds, it returns good results in general.

Let us prepare test data and create two related variables x,y, where y is equal to x elevated to an exponent e, plus some Gaussian noise.

For convenience I have set the Gaussian noise variance dependent to the exponent too.

#test data settinge = 2.

465 #expx = np.

arange(0,25,0.

01)y = x**e + np.

random.

normal(0,10**e,x.

shape)If we plot the data with a seaborn regression plot, we can easily spot a non-linear relation.

Cost FunctionWhat we need now is a cost function, a measure of the ‘goodness’ of the linear relation that we want to maximize.

A good indicator is the Pearson product-moment correlation coefficient r, which identifies the strength of the linear correlation between two variables.

Pearson r has values between -1 and 1, where 1 is a perfect positive linear correlation, 0 is no linear correlation, and −1 reveals a perfect negative linear correlation; it means that r = -1 is good as r = 1.

Thus, to use Pearson r properly, we will take its absolute value and negate it, because scipy.

optimize functions search for minima, whereas we want its maximum.

Let us define the cost function:#define cost functiondef cost_function(e): #y and x are already defined r = np.

corrcoef(y,x**e) #returns correlation matrix #print each iteration print('r value: {:0.

4f} exp: {:.

4f}'.

format(r[0][1],e)) return -abs(r[0][1])Optimize functionAt this point, we have to call one of the Scipy methods.

A suitable choice could be the minimize_scalar method since our cost function is a scalar function.

The algorithm behind this package is Brent’s method, a root finding algorithm without gradient estimation.

I’ve found a very exhaustive video by Oscar Veliz channel on Brent’s method and its dependency on Dekker’s and secant methods.

Check it out if you want to know more about this, and others, optimization function.

Let us import and call minimize_scalar function:from scipy.

optimize import minimize_scalarminimize_scalar(cost_function)We can also set a search range, avoiding the 0 value for the exponent which implies the Pearson r to return an invalid value, even if numpy.

corrcoeff can handle it.

The coefficient is, in fact, defined as:If x is elevated to 0 the standard deviation is 0, and the ratio returns an invalid value.

To perform a bounded search let us call:minimize_scalar(cost_function,bounds=(0.

1, 10), method='bounded')The resulting listing is:r value: 0.

9242 exp: 3.

8815r value: 0.

8681 exp: 6.

2185r value: 0.

9416 exp: 2.

4371r value: 0.

9100 exp: 1.

2663r value: 0.

9407 exp: 2.

7565r value: 0.

9416 exp: 2.

4255r value: 0.

9416 exp: 2.

4861r value: 0.

9416 exp: 2.

4815r value: 0.

9416 exp: 2.

4819r value: 0.

9416 exp: 2.

4819r value: 0.

9416 exp: 2.

4819r value: 0.

9416 exp: 2.

4819fun: -0.

9416331392353501 message: 'Solution found.

' nfev: 12 status: 0 success: True x: 2.

4818969221255713The resulting exponent found, in just 12 iterations, is 2.

482, really close to the exponent we have used to generate the data that is 2.

465.

The voice fun shows the value of the negative absolute value of Pearson r, which seems to be quite high.

Let us plot again y and x applying the exponent found on x, we will notice a strong linear relationship:If we store each iteration exponent and related Pearson coefficient, we can plot the r-exponent curve.

Other testsWhat if we increase the impact of noise in the testing data?.Let us increment Gaussian variance in the noise generator:y = (x**e) + np.

random.

normal(0,20**e,x.

shape)The execution of the optimization function returns the following result:fun: -0.

42597730774659237 message: 'Solution found.

' nfev: 13 status: 0 success: True x: 2.

2958258442618553The optimum exponent found is not as precise as the previous result, but it still performs a good approximation.

Increasing more the noise impact will lead to misleading results due to the overcome of noise on core data.

ConclusionOptimization methods are a gold mine for many application ready to be explored.

With this article, I don’t want to teach a new technique but I want to promote the experimentation of these effective methods on ‘unusual’ problems.

.