How to Apply Self-Supervision to Tabular Data: Introducing dfencoderDenoising autoencoders for the everyday data scientist.
Michael KlearBlockedUnblockFollowFollowingJun 26In 2016, Facebook’s Chief AI Scientist Yan LeCunn made an analogy:“If intelligence is kind of a cake… unsupervised learning is the génoise, the bulk of the cake.
” — Yan LeCun, “Deep Learning and the Future of AI”, 2016He elaborates on this analogy, comparing supervised learning to the icing on the cake and reinforcement learning to the cherry on top.
In 2019, he updated his analogy, amending unsupervised learning to self-supervised learning.
Self-Supervised LearningUnsupervised learning is an old and well-understood problem in machine learning; LeCun’s choice to replace it as the star in his cake analogy is not something he should take lightly!If you dive into the definition of self-supervised learning, you’ll begin to see that it’s really just an approach to unsupervised learning.
Since many of the breakthroughs in machine learning this decade have been based on supervised learning techniques, successes in unsupervised problems tend to emerge when researchers re-frame an unsupervised problem as a supervised problem.
Specifically, in self-supervised learning, we find a clever way to generate labels without human annotators.
An easy example is a technique called next-step prediction.
Given a sequence (of words or video frames, for example), a model can predict the next step.
Since we already have sequences, there’s no need for human annotators to create labels; we can just truncate the sequence at step t-1 and use step t as a “label” (or target) to optimize our supervised learning algorithm.
Next-step prediction is a simple example, but there is a whole suite of tricks within the exploding realm of self-supervised techniques.
Denoising Autoencoders: An Old Trick for New… Dogs?You can’t teach an old dog new tricks, but denoising autoencoders have been around for a long time.
Yan LeCun himself introduced the concept of denoising autoencoders (DAE) in 1987, and his work is referenced in ongoing research in this area from the 2000’s and 2010's.
The idea is related to classic autoencoders, but with a fun twist: instead of the classic setup where the model input is equal to the supervised target, we corrupt, or add noise to each example before feeding it into the model.
The target is the un-corrupted version of the example.
Visual example: We can add noise to an MNIST image and use the un-altered version as a target to train a neural network.
The Manifold InterpretationIn the context of unsupervised learning, the task of the unsupervised model is to learn the distribution of examples in the feature space; this distribution can be understood as a manifold in the feature space.
By corrupting the input, we’re taking examples and artificially “pushing” them away from the manifold.
By using the unaltered input example as a target, we are able to train our model to approximate a function that projects points in the feature space onto the feature distribution manifold.
The manifold interpretation of the denoising task.
We’re looking at a simple feature space, where the “true” distribution of examples lies close to the line (manifold).
The circle on the left shows the result of corrupting an input.
The lines on the right show the learned function of projecting points onto the manifold (source).
For any readers who are struggling to grasp the manifold concept, this is just a way of saying that corrupted inputs don’t look like examples “in the wild,” but their targets do.
The task of learning to make corrupted inputs “look like” true inputs gives our model a way of learning the distribution itself.
In this way, the denoising task is a way to perform the unsupervised learning task of learning the distribution of examples in the feature space.
The benefit is that we can train with example/target pairs, which opens the door to using supervised learning techniques.
That makes this a self-supervised learning scheme.
This is one of the oldest self-supervised schemes around!Contrast to Classic AutoencodersA “vanilla” autoencoder does not corrupt inputs.
The input to the model is identical to the target.
These models learn to approximate the identity function.
Because the identity function is easy to learn, classic autoencoders have a specific requirement: at least one of the internal representations of the model input must have fewer dimensions than the input.
In this way, the task is really “compression.
” How can we encode the input vector in a way that preserves information but reduces dimensionality?.The autoencoder figures out how to answer that question.
Classic “vanilla” autoencoder setup; a “bottleneck” layer (z) is required to make the identity function non-trivial (source).
The DAE is not truly an autoencoder, as it does not learn the identity function.
It actually learns a function that projects points onto the manifold described earlier.
This removes the need for a bottleneck layer in DAE.
A DAE does not need a low-dimensional “bottleneck” layer.
Copyright by Kirill Eremenko (Deep Learning A-Z™: Hands-On Artificial Neural Networks)Representation LearningDeep neural networks are so powerful because of their ability to learn latent features, or representations, of examples.
What does that mean?Here’s a simple, contrived example.
Imagine we are a car insurance company.
We have a table of applications with features about the cars that our applicants want to ensure.
Here is a snippet from that table:Our contrived example dataset.
One example of a latent feature here is “wealth.
” Although we don’t know how much money the applicants have, and we can’t be sure about it using just these features, it is likely that applicant with the Ferrari is more wealthy than the applicant driving the Honda Civic.
This is the sort of thing an artificial neural network can figure out — without ever explicitly introducing the concept of wealth to the model.
Vanilla autoencoders are a good idea, but the requirement for a bottleneck layer forces our network to do some summarizing.
This tempers the model’s ability to do good representation learning.
That’s why DAE is such a good idea!.Instead of compressing our examples, we can open them up; we can expand each example into a higher-dimensional space that includes latent features, explicitly expressing information learned from the entire dataset in each example.
Why Tabular Data?These techniques work really well in research, but we rarely ever see tabular data in papers.
I want my readers to start considering using DAE for tabular data.
This is because in the mundane, ordinary, day-to-day grind of data scientists, tabular data is what we have.
Data-driven businesses are packed with huge distributed data storage systems, all stuffed with millions of rows and columns of tabular data.
Though underrepresented in ML research, the vast majority of our information is stored in tables.
DAE For Tabular Data: A Success StoryIn 2017, a Kaggle competition winner revealed his winning strategy: representation learning with DAE.
This was a completely tabular dataset.
Say what you will about DAE for tabular data, but the proof is in the pudding.
Michael won the contest using DAE.
Noise for TablesThe challenge with tabular data is the fact that each column represents its own unique distribution.
We have categories, numbers, ranks, binary values, etc.
, all mashed into the same example.
This poses a significant challenge for applying DAE: what kind of noise do we use?Some of the original research in DAE corrupts input values by setting them to zero.
For categorical columns, “zero” doesn’t make sense.
Do we just randomly set some values to the category encoded as zero?.That doesn’t seem appropriate.
Swap NoiseThe key to Michael’s winning solution was a type of noise that he calls “swap noise.
”It’s a really simple idea that’s uniquely qualified for tabular data.
Instead of setting values to zero or adding some Gaussian noise to them, we’ll just randomly pick some cells in our dataframe — and replace their values with values from the same column but randomly sampled rows.
This provides a computationally cheap way to sample values from the distribution of a column while avoiding the need to actually model these distributions.
The noise comes with a parameter: how likely is it that we swap a given value?.The kaggle-winning recommendation:“15% swapNoise is a good start value.
” — Michael Jahrer, Porto Seguro Safe Driver Competition WinnerThat is, 15% of cells in a table should be randomly replaced with values from their own columns.
Start here and tune the nob to see what works best for your dataset.
dfencoder — DAE for Tabular DataI wanted to try this out for myself.
I thought, “wouldn’t it be nice if a pandas dataframe had a method to apply swap noise?”It doesn’t really make sense for the pandas to have this feature, but I would find it helpful.
I decided to implement it myself.
Thus, dfencoder was born.
The very first feature: EncoderDataFrame.
It’s a pandas dataframe with .
swap(), a method that returns a copy of the dataframe with swap noise (default .
Example of .
swap() method usage.
From there, the project just sort of snowballed into what it is now: a complete end-to-end framework for building and training DAE with tabular data.
To learn more about how to use the framework, take a look at the demo notebook.
(Project is still in development but free to use under BSD license.
)Applications of DAEWe’ve discussed DAE as feature extractors.
But what else can they do for us?There are many applications, but just to name a few:Feature ImputationAnomaly DetectionFeature ExtractionExploratory Analysis (e.
, category embeddings)You can see how to approach these use cases in the dfencoder demo notebook.
Maybe you or your organization just has a bunch of unused tables of data — if you have some compute resources sitting around unused, why not apply them to learning useful representations of your data?The dfencoder PhilosophyYou have a lot going on: deadlines, ongoing projects on the backburner, and friends and family to boot.
We don’t have time to go down rabbit holes and try to build state-of-the-art representation learning models.
We stick to what works: boosted models, random forests; heck, regularized logistic regression almost always does the trick!dfencoder has blazed ahead, dealing with all the boilerplate code so you don’t have to.
If you want to build a DAE, you’ve got enough to worry about:number of hidden layerssize of hidden layersactivation function(s)swap noise parameterlearning rateoptimizerregularizationfeature scalingand so much more!I want you to focus on this stuff and actually make a useful DAE, tuning these parameters to optimize whatever it is you’re doing.
I don’t want you to get stuck writing a bunch of code I’ve already written!This is why I created dfencoder.
Check out the library — it’s available for python 3.
6 and uses torch as its deep learning backend.
Reproducing Kaggle-Winning ResultsI started this project hoping to reproduce the results that the DAE kaggle champion got.
I don’t have a GPU, so I used Google Colab for the free GPU backend— unfortunately, the system memory (~14 GB) isn’t big enough to do the problem the way Michael did it (with 32 GB memory).
Not to mention, training took him DAYS on his beefy system.
The beginnings of a notebook replicating his procedure can be found here; anyone with the hardware can take a look at the notebook and modify it as necessary to scale up to the winning hyperparameters.
Let me know if you’re able to get it to work!Good Luck!I’m curious about your story.
Comment here or mention the library somewhere if you find it interesting or useful! Feedback and feature requests are welcomed :).