# LOESS

First of all, think of the red line as an ordered sequence of equally spaced x values, in this case between 0 and 2π.

For each of these values, select an appropriate neighborhood of sampled points, and use them as the training set for a linear regression problem.

With the resulting model, estimate the new value for your point.

Let us explore this idea in a bit more detailThe Idea Behind LOESSThis algorithm estimates the latent function in a point-wise fashion.

For each value of x, we estimate the value of f(x) by using its neighboring sampled (known) values.

This is quite similar to a KNN algorithm, where k, the window size, is a tunable parameter and, in this particular case, will determine the smoothness of the resulting estimate.

In a sense, k is your bias vs.

variance knob.

Large values of k will result in higher bias and lower values will induce higher variance.

The first step is to collect the value of x for which we want to estimate y.

Let’s call these x’ and y’.

By feeding the LOESS algorithm with x’, and using the sampled x and y values, we will obtain an estimate y’.

In this sense, LOESS is a non-parametric algorithm that must use all the dataset for estimation.

Now that we have x’, we must find its k nearest neighbors using a simple Euclidean distance.

Let’s call the resulting ordered set D.

The next step converts the set D of k distances into an ordered set W containing weights that will be later used in the linear regression process.

These weights are calculated using a specialized weight function that assigns importance to each of the k neighbors of x according to its distance to x’.

The Weight FunctionDistance weights are calculated using the tri-cubic function:Tri-cubic weighting functionThis function looks like a hat and has positive values only between -1 and 1.

Outside of this interval, the function is zero.

Here is what the function looks like:Graph of the tri-cubic weight functionAs this function only has positive results for -1 < x < 1, we must normalize the distance by dividing it by the maximum value observed in D.

More concretely,Weighting functionHere, we denote d(x, x’) as the distance between x, one of the k nearest neighbors, and x’.

The effect of normalization is that larger distances will be associated with lower weights.

At the very extreme, the point corresponding to the maximum distance will have a weight of zero, and the point at zero distance will have the highest possible weight — one.

That is how the “locality” effect is achieved, by assigning higher importance to the training data that is closest to where we want the prediction to be calculated.

As a side note, you may find that this function has a striking similarity to the tri-cubic kernel function.

The difference in scale (70/81) between these functions relates to the requirement that a kernel function must integrate to one over its domain, while here that requirement is relaxed.

We are now ready to calculate the estimate using a simple weighted linear regression that is trained with the x values from D, and the corresponding y values.

Linear RegressionLinear regression is the bread-and-butter of supervised machine learning methods.

Odds are, you started your ML journey learning the innards of this method, probably trying to figure out the sale price for households in Portland, given their physical features.

Or maybe it was something else entirely, but you know the drill, don’t you?It so happens that a specialized version of linear regression, weighted linear regression, is at the heart of LOESS.

For every point that we set out to estimate (x’), the LOESS algorithm must set up a linear regression model that will calculate the corresponding output (y’), using the k nearest neighbors of x’ and a set of weights that rates their importance.

The local linear regression usually models low-dimensional polynomials, a line or a quadratic.

The first-degree regression equationThe second-degree regression equationWeighted linear regression is a known problem and is abundantly documented online.

Due to the typical low dimensionality of the problems that will be tackled, we will resort to the closed-form normal equations for parameter estimation.

In the unweighted case, these equations are:Normal equations for linear regressionWere beta is the vector of linear parameters, X is the matrix containing all x observations, arranged like so:The X matrixConcretely, this matrix models a sample with n dimensions and m observations.

Note that I am including the intercept term in the matrix through the first column.

For the case when we are modeling a second-degree polynomial, this matrix is actually:Second-degree X matrixOnce we have the beta vector, new values of y can be calculated using the following equation:Extending this concept to using weights is actually quite simple and the normal equation just needs an extra term:Weighted normal equationHere, the weight matrix W has all the calculated weights in the diagonal with all other elements set to zero.

Unfortunately, as you will see in the implemented Python code, the matrix approach can be a bit slow.

If you stick to the first-degree model, an alternative approach can be taken using simpler math:Looks complex but the implementation is far simpler through the use of internal products of vectors to eliminate explicit sums.

Notation note: d stands for the number of items in D, which is actually k.

On to the code.

Python LibrariesYou can find an implementation of this smoother in the StatsModels Python package.

By reading through the method documentation, you see that lowess function returns an array with the same dimension as the two input arrays (x and y).

This means that only the observed values are smoothed so if you need any other values in between, you will have to somehow interpolate them.

But this does not have to be this way.

Python ImplementationFor this article, I developed a new implementation based on NumPy that leverages its vectorization features, and the code can be found in this GitHub repository.

The code was developed with vectorization in mind and there is only one loop in the function that determines the indexes of the closest values.

Let us step through the code and see how it works.

The tri-cubic weighting function is fully vectorized and it processes arrays of x values.

First, the output array y is created with the same dimensions as the input array x.

Next, an indexing array is created to enforce the function’s domain and finally, the function itself is calculated.

Note that the indexing array is used on both the input and output arrays.

My first approach was to vectorize the code using Numba, but then I realized that this approach had the same performance, and did away with the unnecessary compilation.

Upon initialization, both input arrays must be normalized to avoid problems of loss of significance (aka, catastrophic cancellation).

This is done quite simply with a rescaling to the interval between zero and one.

Weights are calculated from the array of distances with the help of an indexing array, that contains the indexes of the minimal-distance window.

This indexing array is calculated in the next function:In order to calculate the range with the minimum total distance to x’, we start by determining the index of the minimum distance within the distances array.

Knowing that the indexes must be consecutive, we can use this initial index as the root of a growing list of indexes.

The tests at the top of the function just handle the edge cases when the minimum index is at the extremes of the distances array.

The following loop grows the list of indices, starting from the index of the minimal distance, adding items left and right as needed and keeping the list naturally sorted, inserting to the left and appending to the right.

Note that the number of loops is limited to k-1.

Now, we get to the heart of the code.

The function that estimates f(x) can be used in two modes: matrix or statistical.

In matrix mode, you can specify a polynomial degree but will have lower performance.

The statistical code is faster but only models lines.

The function starts by normalizing the input x value and calculating its distance to all the training values.

The array of distances has the same dimension as the training data.

Next, the minimum distance range is found and the corresponding weights calculated.

Note that the array of weights has k (the window size) items.

Finally, the regression is trained and the estimated value for f(x) is calculated using either of the methods described above.

Please note that if you want to use a polynomial regression the code will use “matrix mode”.

Finally, here’s a sample of how to use the code (data values are taken from NIST):Please note that you can provide values of x other than the ones in the training data.

Here’s an example of a smoothing function on the same data as the first chart’s:Interpolated function in blackYou can play with this chart by using the companion notebook in the GitHub repo.

ConclusionWe have gone through the rationale for using the LOESS local regression model and lifted the veil on how it works.

A Python implementation was developed and presented making heavy use of the NumPy library and its vectorization feature.

Thank you for your time!NotesI found this definition in [1].

The author makes no mention of the “LOWESS” term.

References[1] Gareth, J.

Witten, D.

Hastie, T.

Tibshirani, R.

(2013).

An Introduction to Statistical Learning with Applications in R.

New York: Springer[2] Alpaydın, E.

(2014).

Introduction to machine learning.

3rd ed.

Cambridge, Massachusetts: The MIT Press.

[3] Starmer, J.

(2017).

StatQuest: Fitting a curve to data, aka lowess, aka loess, YouTube.

.