Learning Theory: (Agnostic) Probably Approximately Correct Learning

The definition states that a hypothesis class is PAC learnable if there exists a function m_H and an algorithm that for any labeling function f, distribution D over the domain of inputs X, delta and epsilon that with m ≥ m_H produces a hypothesis h such that with probability 1-delta it yields a true error lower than epsilon.

A labeling function is nothing else than saying that we have a certain function f that labels the data in the domain.

Here, the hypothesis class can be any type of binary classifier, since the labeling function that assigns labels to the examples from the domain assigns the labels 0 or 1.

The m_H function gives us a bound to the minimal number of samples that we need to have to achieve an error lower than epsilon with confidence delta.

The accuracy epsilon logically controls the necessary sample size, since the higher accuracy we have, logically we need our training set to be a more faithful sample from the domain, therefore, increasing the number of samples necessary to achieve such accuracy.

Making the Model AgnosticPhoto by Evan Dennis on UnsplashThe above model has a certain drawback, it is not general enough because of the realizability assumption (explained in Empirical Risk Minimization)— nobody guarantees that there exists a hypothesis that will result in a true error of 0 in our current hypothesis space because of a failure of the model.

Another way of looking at it is that perhaps the labels aren’t well defined by the data because of lacking features.

The way that we circumvent the realizability assumption is by replacing the labeling function by a data-labels distribution.

You can look at this as introducing uncertainty in the labeling function since one data point can share different labels.

So why is it called Agnostic PAC learning?.Well, the word agnostic comes from the fact that the learning is agnostic towards the data-labels distribution — this means that it is going to learn the best labeling function f by making no assumptions about the data-labels distribution.

What changes in this case?.Well, the true error definition changes since a label to a data point is a distribution over multiple labels.

We cannot guarantee that the learner is going to achieve the minimal possible true error, since we do not have the data-labels distribution to account for the label uncertainty.

After those considerations, we arrive at the following formal definition from the book:Notice the changes in the definition with regards to the definition of PAC learnability.

By introducing the data-labels distribution D we allow for the fact that the true error of the learned hypothesis is going to be less or equal to the error of the optimal hypothesis plus a factor epsilon.

This also encapsulates PAC learning itself in the case that there is an optimal hypothesis in the hypotheses space that produces a true error of 0, but we also allow for the fact that there is maybe no such hypothesis.

These definitions are going to be useful later on in explaining the VC-dimension and proving the No free lunch theorem.

In case if the terminology was a bit foreign to you, I advise you to take a look at Learning Theory: Empirical Risk Minimization or a more detailed look at the brilliant book from Ben-David mentioned in the article.

Other than that, keep machine learning!.