In tis case we can use Spectral Graph Theory methods to create low dimensional embedding of the categorical features.

The idea came from spectral word embedding, spectral clustering and spectral dimensionality reduction algorithms.

If you can define a similarity measure between different values of the categorical features, we can use spectral analysis methods to find the low dimensional representation of the categorical feature.

From the similarity function (or kernel function) we can construct an Adjacency matrix, which is a symmetric matrix, where the ij element is the value of the kernel function between category values i and j:It is very important that I only need a Kernel function, not a high-dimensional representation.

This means that 1-hot encoding step is not necessary here.

Also for the kernel-base machine learning methods, the categorical variable encoding step is not necessary as well, because what matters is the kernel function between two points, which can be constructed using the individual kernel functions.

Once the adjacency matrix is constructed, we can construct a degree matrix:Here δ is the Kronecker delta symbol.

The Laplacian matrix is the difference between the two:And the normalize Laplacian matrix is defined as:Following the Spectral Graph theory, we proceed with eigendecomposition of the normalized Laplacian matrix.

The number of zero eigenvalues correspond to the number of connected components.

In our case, let’s assume that our categorical feature has two sets of values that are completely dissimilar.

This means that the kernel function K(i,j) is zero if i and j belong to different groups.

In this case we will have two zero eigenvalues of the normalized Laplacian matrix.

If there is only one connected component, we will have only one zero eigenvalue.

Normally it is uninformative and is dropped to prevent multicollinearity of features.

However we can keep it if we are planning to use tree-based models.

The lower eigenvalues correspond to “smooth” eigenvectors (or modes), that are following the similarity function more closely.

We want to keep only these eigenvectors and drop the eigenvectors with higher eigenvalues, because they are more likely represent noise.

It is very common to look for a gap in the matrix spectrum and pick the eigenvalues below the gap.

The resulting truncated eigenvectors can be normalized and represent embeddings of the categorical feature values.

As an example, let’s consider the Day of Week.

1-hot encoding assumes every day is similar to any other day (K(i,j)=1 ).

This is not a likely assumption, because we know that days of the week are different.

For example, the bar attendance spikes on Fridays and Saturdays (at least in USA) because the following day is a weekend.

Label encoding is also incorrect, because it will make the “distance” between Monday and Wednesday twice higher than between Monday and Tuesday.

And the “distance” between Sunday and Monday will be six times higher, even though the days are next to each other.

By the way, the label encoding corresponds to the kernel K(i,j)=exp(−γ|i−j|)We will consider an example, where weekdays are similar to each other, but differ a lot from the weekends.

array([[ 0, 10, 9, 8, 5, 2, 1], [10, 0, 10, 9, 5, 2, 1], [ 9, 10, 0, 10, 8, 2, 1], [ 8, 9, 10, 0, 10, 2, 1], [ 5, 5, 8, 10, 0, 5, 3], [ 2, 2, 2, 2, 5, 0, 10], [ 1, 1, 1, 1, 3, 10, 0]])array([[ 1.

, -0.

27788501, -0.

24053512, -0.

21380899, -0.

14085904, -0.

07049074, -0.

040996 ], [-0.

27788501, 1.

, -0.

25993762, -0.

23394386, -0.

13699916, -0.

06855912, -0.

03987261], [-0.

24053512, -0.

25993762, 1.

, -0.

25 , -0.

21081851, -0.

06593805, -0.

03834825], [-0.

21380899, -0.

23394386, -0.

25 , 1.

, -0.

26352314, -0.

06593805, -0.

03834825], [-0.

14085904, -0.

13699916, -0.

21081851, -0.

26352314, 1.

, -0.

17376201, -0.

12126781], [-0.

07049074, -0.

06855912, -0.

06593805, -0.

06593805, -0.

17376201, 1.

, -0.

50572174], [-0.

040996 , -0.

03987261, -0.

03834825, -0.

03834825, -0.

12126781, -0.

50572174, 1.

]])array([0.

, 0.

56794799, 1.

50908645, 1.

08959831, 1.

3053149 , 1.

25586378, 1.

27218858])Notice, that the eigenvalues are not ordered here.

Let’s plot the eigenvalues, ignoring the uninformative zero.

We can see a pretty substantial gap between the first eigenvalue and the rest of the eigenvalues.

If this does not give enough model performance, you can include the second eigenvalue, because the gap between it and the higher eigenvalues is also quite substantial.

Let’s print all eigenvectors:array([[ 0.

39180195, 0.

22866879, 0.

01917247, -0.

45504284, 0.

12372711, -0.

41844908, -0.

62957304], [ 0.

40284079, 0.

24416078, 0.

01947223, -0.

4281388 , -0.

53910465, -0.

01139734, 0.

55105271], [ 0.

41885391, 0.

23795901, -0.

0032909 , -0.

00102155, 0.

24759021, 0.

82656956, -0.

15299308], [ 0.

41885391, 0.

21778112, -0.

01536901, 0.

36430356, 0.

56996731, -0.

36551902, 0.

43094387], [ 0.

39735971, -0.

02474713, 0.

07869969, 0.

66992782, -0.

54148697, -0.

08518483, -0.

29331097], [ 0.

3176117 , -0.

61238751, -0.

71702346, -0.

09280736, 0.

02933834, 0.

00752668, 0.

02123917], [ 0.

27305934, -0.

63907128, 0.

69187421, -0.

13963728, 0.

11758088, 0.

02521838, 0.

06615712]])Look at the second eigenvector.

The weekend values have a different size than the weekdays and Friday is close to zero.

This proves the transitional role of Friday, that, being a day of the week, is also the beginning of the weekend.

If we are going to pick two lowest non-zero eigenvalues, our categorical feature encoding will result in these category vectors:array([[ 0.

22866879, -0.

45504284], [ 0.

24416078, -0.

4281388 ], [ 0.

23795901, -0.

00102155], [ 0.

21778112, 0.

36430356], [-0.

02474713, 0.

66992782], [-0.

61238751, -0.

09280736], [-0.

63907128, -0.

13963728]])In the plot above we see that Monday and Tuesday, and also Saturday and Sunday are clustered close together, while Wednesday, Thursday and Friday are far apart.

Learning the kernel functionIn the previous example we assumed that the similarity function is given.

Sometimes this is the case, where it can be defined based on the business rules.

However it may be possible to learn it from data.

One of the ways to compute the Kernel is using Kullback-Leibler Divergence:Where D is Symmetrised KL divergence:Here pi is a probability of the data given the category value i:The idea is to estimate the data distribution (including the target variable, but excluding the categorical variable) for each value of the categorical variable.

If for two values the distributions are similar, then the divergence will be small and the similarity value will be large.

Note that γ is a hyperparameter and will have to be tunedTo try this approach will will use liquor sales data set.

To keep the file small I removed some columns and aggregated the data.

Since we care about sales, let’s encode the day of week using the information from the sales column Let’s check the histogram first:sns.

distplot(liq.

sales, kde=False);We see that the distribution is very skewed, so let’s try to use log of sales columns insteadsns.

distplot(np.

log10(1+liq.

sales), kde=False);This is much better.

So we will use a log for our distributionliq["log_sales"] = np.

log10(1+liq.

sales)Here we will follow this blog for computation of the Kullback-Leibler divergence.

Also note, that since there are no liquor sales on Sunday, we consider only six days in a weekarray([[0.

00000000e+00, 8.

77075038e-02, 4.

67563784e-02, 4.

73455185e-02, 4.

36580887e-02, 1.

10008520e-01], [8.

77075038e-02, 0.

00000000e+00, 6.

33458241e-03, 6.

12091647e-03, 7.

54387432e-03, 1.

24807509e-03], [4.

67563784e-02, 6.

33458241e-03, 0.

00000000e+00, 1.

83170834e-06, 5.

27510292e-05, 1.

32091396e-02], [4.

73455185e-02, 6.

12091647e-03, 1.

83170834e-06, 0.

00000000e+00, 7.

42423681e-05, 1.

28996949e-02], [4.

36580887e-02, 7.

54387432e-03, 5.

27510292e-05, 7.

42423681e-05, 0.

00000000e+00, 1.

49325072e-02], [1.

10008520e-01, 1.

24807509e-03, 1.

32091396e-02, 1.

28996949e-02, 1.

49325072e-02, 0.

00000000e+00]])As we already mentioned, the hyperparameter γ has to be tuned.

Here we just pick the value that will give a plausible resultgamma = 20kernel = np.

exp(-gamma * kl_matrix)np.

fill_diagonal(kernel, 0)kernelarray([[0.

, 0.

17305426, 0.

39253579, 0.

38793776, 0.

41762901, 0.

11078428], [0.

17305426, 0.

, 0.

88100529, 0.

88477816, 0.

85995305, 0.

97534746], [0.

39253579, 0.

88100529, 0.

, 0.

99996337, 0.

99894554, 0.

76783317], [0.

38793776, 0.

88477816, 0.

99996337, 0.

, 0.

99851625, 0.

77259995], [0.

41762901, 0.

85995305, 0.

99894554, 0.

99851625, 0.

, 0.

74181889], [0.

11078428, 0.

97534746, 0.

76783317, 0.

77259995, 0.

74181889, 0.

]])norm_lap = normalized_laplacian(kernel)sz, sv = np.

linalg.

eig(norm_lap)szarray([1.

11022302e-16, 9.

99583797e-01, 1.

22897829e+00, 1.

27538999e+00, 1.

24864532e+00, 1.

24740260e+00])In [18]:sns.

stripplot(data=sz[1:], jitter=False, );Ignoring the zero eigenvalue, we can see that there is a bigger gap between the first eigenvalue and the rest of the eigenvalues, even though the values are all in the range between 1 and 1.

3.

Ultimately the number of eigenvectors to use is another hyperparameter, that should be optimized on a supervised learning task.

The Category field is another candidate to do spectral analysis, and is, probably, a better choice since it has more unique valueslen(liq.

Category.

unique())107array([[0.

00000000e+00, 1.

01321384e-02, 2.

38664557e-01, .

, 5.

83930416e-02, 2.

05621708e+01, 4.

44786939e-01], [1.

01321384e-02, 0.

00000000e+00, 1.

50225839e-01, .

, 1.

17178087e-01, 2.

24843754e+01, 5.

89215704e-01], [2.

38664557e-01, 1.

50225839e-01, 0.

00000000e+00, .

, 5.

33952956e-01, 2.

95549456e+01, 1.

33572924e+00], .

, [5.

83930416e-02, 1.

17178087e-01, 5.

33952956e-01, .

, 0.

00000000e+00, 1.

59700549e+01, 1.

80637715e-01], [2.

05621708e+01, 2.

24843754e+01, 2.

95549456e+01, .

, 1.

59700549e+01, 0.

00000000e+00, 8.

58405693e+00], [4.

44786939e-01, 5.

89215704e-01, 1.

33572924e+00, .

, 1.

80637715e-01, 8.

58405693e+00, 0.

00000000e+00]])plot_eigenvalues(100);We can see, that a lot of eigenvalues are grouped around the 1.

1 mark.

The eigenvalues that are below that cluster can be used for encoding the Category feature.

Please also note that this method is highly sensitive on selection of hyperparameter γ .

For illustration let me pick a higher and a lower gammaplot_eigenvalues(7000);plot_eigenvalues(10)Conclusion and next stepsWe presented a way to encode the categorical features as a low dimensional vector that preserves most of the feature similarity information.

For this we use methods of Spectral analysis on the values of the categorical feature.

In order to find the kernel function we can either use heuristics, or learn it using a variety of methods, for example, using Kullback–Leibler divergence of the data distribution conditional on the category value.

To select the subset of the eigenvectors we used gap analysis, but what we really need is to validate this methods by analyzing a variety of data sets and both classification and regression problems.

We also need to compare it with other encoding methods, for example, entity embedding using Neural Networks.

The kernel function we used can also include the information about category frequency, which will help us deal with high information, but low frequency values.

.