Deep Learning Explainability: Hints from Physics

Deep Learning Explainability: Hints from PhysicsDeep Neural Networks from a Physics ViewpointMarco TavoraBlockedUnblockFollowFollowingMar 24Abstract Mandelbrot fractal.

Picture by Astonira/Shuttestock.

com.

Nowadays, artificial intelligence is present in almost every part of our lives.

Smartphones, social media feeds, recommendation engines, online ad networks, and navigation tools are some examples of AI-based applications that already affect us every day.

Deep learning in areas such as speech recognition, autonomous driving, machine translation, and visual object recognition has been systematically improving the state of the art for a while now.

However, the reasons that make deep neural networks (DNN) so powerful are only heuristically the energy in a physical system that can be converted to do work”.

Mathematically, it is given in our case by:The symbol “tr” stands for trace (from linear algebra).

In the present context, it represents the sum over all possible configurations of visible spins v.

At each step of the renormalization procedure, the behavior of the system at small length scales is averaged out.

The Hamiltonian of the coarse-grained system is expressed in terms of new coupling constantsand new, coarse-grained variables are obtained.

In our case, the latter are block spins h and the new Hamiltonian is:To better understand what are block spins, consider the two-dimensional lattice below.

Each arrow represents a spin.

Now divide the lattice into square blocks each containing 2×2 spins.

The block spins are the average spins corresponding to each of these blocks.

In block spin RG, the system is coarse-grained into new block variables describing the effective behavior of spin blocks (source).

Note that the new Hamiltonian has the same structure as the original one, only with configurations of blocks of spins in place of physical spins.

Both Hamiltonians have the same structure but with different variables and couplings.

In other words, the form of the model does not change but as we zoom out the parameters of the model change.

The full renormalization of the theory is obtained by systematically repeating these steps.

After several RG iterations, some of the parameters will be dropped out and some will remain.

The ones that remain are called relevant operators.

A connection between these Hamiltonians is obtained by the requirement that the free energy (described a few lines above) does not change after an RG-transformation.

Variational Renormalization group (VRG)As mentioned above, to implement the RG mappings one can use the variational renormalization group (VRG) scheme.

In this scheme, the mappings are implemented by an operatorwhere λ is a set of parameters.

This operator encodes the couplings between hidden and input (visible) spins and satisfies the following relation:which defines the new Hamiltonian given above.

Though in an exact RG transformation, the coarse-grained system would have exactly the same free energy as the original system i.

e.

which is equivalent to the following conditionin practice, this condition cannot be satisfied exactly and variational schemes are used to find λ that minimizes the difference between the free energiesor equivalently, to approximate the exact RG transformation.

A Quick Summary of RBMsI have described in some detail the internal workings of restricted Boltzmann machines in a previous article.

Here I will provide a more condensed explanation.

Neural Quantum StatesHow neural networks can solve highly complex problems in quantum mechanicstowardsdatascience.

comRestricted Boltzmann Machines (RBMs) are generative, energy-based models.

used for nonlinear unsupervised feature learning.

Their simplest version consists of two layers only:One layer of visible units which will be denoted by vOne hidden layer with units denoted by hIllustration of a simple Restricted Boltzmann Machine (source).

Again I will consider a binary visible dataset v with n elements extracted from some probability distributionEq.

9: Probability distribution of the input or visible data.

The hidden units in the RBM (represented by the vector h) are coupled to the visible units with interaction energy given byThe energy sub-index λ represents the set of variational parameters {c, b, W}.

where the first two elements are vectors and the third one is a matrix.

The goal of RBMs is to output a λ-dependent probability distribution that is as close as possible to the distribution of the input data P(v).

The probability associated with a configuration (v,h) and parameters λ is a function of this energy functional:From this joint probability, one can easily obtain the variational (marginalized) distribution of visible units by summing over the hidden units.

Likewise, the marginalized distribution of hidden units is obtained by summing over the visible units:We can define an RBM Hamiltonian as follows:The λ parameters can be chosen to optimize the so-called Kullback-Leibler (KL) divergence or relative entropy which measures how different two probability distributions are.

In the present case, we are interested in the KL divergence between the true data distribution and the variational distribution of the visible units produced by the RBM.

More specifically:When both distributions are identical:Exactly mapping RG and RBMMehta and Schwab showed that to establish the exact mapping between RG and RBMs, one can choose the following expression for the variational operator:Recall that the Hamiltonian H(v) contains encoded inside it the probability distribution of the input data.

With this choice of variational operator, one can quickly prove the RG Hamiltonian and the RBM Hamiltonian on the hidden layer are the same:Also, when an exact RG transformation can be implemented, the true and variational Hamiltonian are identical:Hence we see that one step of the renormalization group with spins v and block-spins h can be exactly mapped into a two-layered RBM made of visible units v and hidden units h.

As we stack increasingly more layers of RBMs we are in effect performing more and more rounds of the RG transformation.

Application to the Ising ModelFollowing this rationale, we conclude that RBMs, a type of unsupervised deep learning algorithm, implements the variational RG process.

This is a remarkable correspondence and Mehta and Schwab demonstrate their idea by implementing stacked RBMs on a well-understood Ising spin model.

They fed, as input data, spin configurations sampled from an Ising model into the DNN.

Their results show that, remarkably, DNNs seem to be performing (Kadanoff) block spin renormalization.

In the authors’ words “Surprisingly, this local block spin structure emerges from the training process, suggesting the DNN is self-organizing to implement block spin renormalization… I was astounding to us that you don’t put that in by hand, and it learns”.

Their results show that, remarkably, DNNs seem to be performing block spin renormalization.

In the figure below from their paper, A shows the architecture of the DNN.

In B the learning parameters W are plotted to show the interaction between hidden and visible units.

In D we see the gradual formation of block spins (the blob in the picture) as we move from along the layers of the DNN.

In E the RBM reconstructions reproducing the macroscopic structure of three data samples are shown.

Deep neural networks applied to the 2D Ising model.

See the main text for a detailed description of each of the figures (source).

Conclusions and OutlookIn 2014 it was shown by Mehta and Schwab that a Restricted Boltzmann Machine (RBM), a type of neural network, is connected to the renormalization group, a concept originally from physics.

In the present article, I reviewed part of their analysis.

As previously recognized, both RG and deep neural networks bear a remarkable “philosophical resemblance”: both distill complex systems into their relevant parts.

This RG-RBM mapping is a kind of formalization of this similarity.

Since deep learning and biological learning processes have many similarities, it is not too much of a stretch to hypothesize that our brains may also use some kind of “renormalization on steroids” to make sense of our perceived reality.

As one of the authors suggested, “Maybe there is some universal logic to how you can pick out relevant features from data, I would say this is a hint that maybe something like that exists.

”It is not too much of a stretch to hypothesize that our brains may also use some kind of “renormalization on steroids” to make sense of our perceived reality.

The problem with this is that in contrast to self-similar systems (with fractal-like behavior) where RG works well, systems in nature generally are not self-similar.

A possible way out of this limitation, as pointed out by the neuroscientist Terrence Sejnowski, would be if our brains somehow operated at critical points with all neurons influencing the whole network.

But that is a topic for another article!Thanks for reading and see you soon!.As always, constructive criticism and feedback are always welcome!My Github and personal website www.

marcotavora.

me have (hopefully) some other interesting stuff both about data science and about physics.

.