To get an idea, let’s visualise this function.

Figure 3.

Error function of the first modelAs you can instinctively guess from the 3D graph above, this function is a convex function.

Optimisation (finding the minimum) of convex function is much simpler than general mathematical optimisation, as any local minimum is always the global minimum in convex function.

(Very simple explanation is that convex functions have only one minimum point, such as shape of “U”) Thanks to this characteristic of convex functions, the parameters minimising the function can be found by simply solving partial differential equations as following.

Let’s solve our case.

By solving equations above, we obtain a = 5/6, b = 1/2.

So, our first model (which minimises RSS) is obtained as below.

Figure 4.

The first modelExample 2: Simple curvy modelNow, for the same data points, let’s think about another model like below.

As you can see, this is not a linear function to input variable x anymore.

However, this is still a linear function to parameters a,b.

Let’s see how the change affects procedure of the model fitting.

We’ll use the same error function as the previous example — RSS.

As seen above, equation looks very similar to the previous one.

(Values of coefficients are different, but form of the equation is same.

) The visualisation is below.

Figure 5.

Error function of the second modelThe shape also looks similar.

And this is still a convex function.

Secret here is that when we calculate errors with training data, input variables are given as concrete values (for example, values of x² are given as 2², 5² and 8² in our data set — (2,4), (5,1), (8,9) ).

So no matter how complicated the form of input variables is (e.

g.

x, x², sin(x), log(x) etc…), values are given as just constants in the error function.

Since the error function of the second model is also a convex function, we can find the optimal parameters by exact same procedure as the previous example.

By solving equations above, we obtain a = 61/618, b = 331/206.

So, our second model is obtained as below.

Figure 6.

The second modelConclusion: Linearity behind linear regression models2 examples above are solved in completely same (and very simple) procedure even one is linear to input variable x and one is non-linear to x.

The common characteristic in the 2 models is that both functions are linear to parameters a,b.

This is the linearity assumed behind the linear regression models, and this is the key to the mathematical simplicity of linear regression models.

We have only seen 2 very simple models above, but in general, model’s linearity to its parameters assures that its RSS is always a convex function.

This is the reason why we can get the optimal parameters by solving simple partial differential equations.

And that’s why the linearity matters.

.. More details