Dealing with highly correlated columns in ML models

It is commonly accepted that the square of correlation is a good approximation for how well a column is described by another column.

What would happen if I squared all of those correlation matrices and added them up?Each group has 6 columns, so if all the columns were independent (and therefore had 0 correlation), the sum of squares would be 6, as each column has a correlation of 1 with itself.

If they were all identical, then all correlations would be 1, and the sum would be 36.

The average across each column in the first case is 1, and in the second case it is 6.

It intuitively makes sense to me that there is 6 columns ‘worth’ of information in the first case, and 6/6=1 column ‘worth’ of information in the second case.

Finding this sum for the above 3 examples, dividing by 6 each time, we get:To get the non redundant number of columns in each group, we divide the total number of columns in the group by this number.

This suggests that the pay_ group actually has 6/3.

21= 1.

87 columns ‘worth’ of information, the bill_amt group has 6/4.

94= 1.

21 columns worth, and the pay_amt group has 6/1.

19=5.

05 columns worth.

This intuitively makes sense, as the bill_amt columns were almost identical, whereas the pay_amt columns were not.

I also asked myself what would happen if these groups were correlated with each other, e.

g.

‘bill amt’ and ‘pay amt’, and then I arrived at an elegant solution.

SolutionMy solution is to come up with a redundancy score for each column, which I will call C², which is the sum of the squares of the correlation between that column and all other columns in the data.

Here are the top and bottom columns by C² measure:Unsurprisingly, bill_amt and pay_ columns are at the topNote how C² never goes below 1, as discussed earlier, and is nearly 6 for the bill_amt columns (very redundant!).

Applying the solution: Ridge RegressionIf you are doing a ridge regression, there is a penalty which is proportional to the coefficient of a column squared.

e.

g.

if your model is trying to predict X, using A, and you say X=2A, the penalty will be (2²)λ , which is 4λ.

λ is a hyper parameter set before doing the ridge regression.

Now if you create a column B=A, your model will be X= A+B, which will have a penalty of (2*1²)λ, or 2λ, half as much as before.

This means that redundant columns are under-penalised by ridge regressions.

This problem can be fixed by dividing the values in the columns by sqrt(C²).

In the above case, the correlation between A and B is 1, so the C² of each of the columns will be 2.

If we divide each of the columns by √2, we’ll get the equation X= √2 A + √2 B, giving us a penalty of (2*√2²)λ, or 4λ as before.

This means that our model is not biased by adding in the new column.

Disclaimer: Remember to divide after using a scaler, not before!Applying the solution: K-Nearest NeighboursThe K-Nearest Neighbors (KNN) algorithm attempts to guess the target variable by looking at the similar data points.

The number of similar data points it looks it is ‘K’, and it determines similarity by minimising ‘distance’.

A common way of measuring distance is Euclidean distance- the straight line distance between two points.

This is calculated by squaring the differences in each column, adding them up, and square rooting the sum.

For example, if the distance in column A is 3, and column B is 4, the Euclidean distance is sqrt(3² + 4²)= sqrt(25)= 5.

In the above example, if we define a new column X=A, distance in column X will be 3, and new Euclidean distance will be sqrt(2*3² + 4²)= sqrt(34).

This creates an undue bias towards the redundant information.

This can again be fixed by dividing both columns by sqrt(C²).

In the above example, the correlation between A and X is 1, so if B is independent from A, their C² values would be 2.

So after dividing by sqrt(2), the distance will be 3/sqrt(2) in each column.

The new Euclidean distance will be sqrt( 2* (3/√2)² + 4²) = sqrt(2*4.

5 + 4²) = sqrt(25), the same as before!Note: If you are using the Minkowski distance of order p, you divide the columns by (C²)^(1/p).

In the Euclidean case, if you’re measuring Manhattan distance, p=1, so you divide the columns by C².

Disclaimer: Remember to divide after using a scaler, not before!Applying the solution: Random ForestsA random forest is a collection of randomly generated decision trees that ‘vote’ on the solution to a machine learning problem.

Random data is selected via bootstrapping, and random ‘features’, which are random columns are selected at each node.

If the columns have redundant information, then random selection will bias us towards the redundant information.

This can actually be resolved quite easily, by giving each column a weight of 1/C² when picking them.

When picking one out of A and B, the odds are 50/50.

If we add X=A, then there is a 2/3 chance we get the information from A.

Since A and X have a correlation of 1, they both have a C² score of 2.

Dividing by C² we get weights of 0.

5 for A, 0.

5 for X, and 1 for B.

This means that there’s a 25% chance we pick A, 25% chance we pick X, and 50% chance we pick B — back to 50/50.

ConclusionsI feel what I have discussed above is a smooth solution to some of the problems that arise from correlated columns in machine learning models.

I tried to look on Google for people who tried similar things and I could not find anything.

If there is either a flaw, or you know of someone who has done so before, please inform me, so I can make the necessary corrections.

.