Lucky for us humans, there is a rather elegant way to mathematically describe how ‘close’ together different data-points are, so that with well-chosen features, examples belonging to the concept will group together and be distinguishable from non-examples in a numerical way that a machine can understand.

This description of ‘closeness’ requires that we represent our data in a feature space.

Concepts in Feature SpaceGiven a set of features for a concept learning problem, we can interpret the feature set as a feature space.

Given some data, a feature space is just the set of all possible values for a chosen set of features from that data.

It is always possible to represent feature values and thus a feature space using only numbers, and further to do so in such a way that the feature space can be interpreted as a real space.

What do we mean by real space?.A mathematician would say that our n-dimensional feature space can be shown to be isomorphic to the vector space ℝⁿ, but what they would mean is that we can move and dance around in our feature space, and that there are n dimensions in which to bust our metaphorical moves.

More dimensions?.More room for activities.

Dancing aside, what’s really important is that we can represent our data as a feature vector giving coordinates in n-dimensional space, where n is (typically) the number of features.

For an example, suppose we have data with features height (m), width (m) and weight (kg).

We can write this as a tuple (height, width, weight) and we can consider this as a 3-dimensional space:3D feature space for (height, width, weight) populated with some examples.

One of the many reasons that this spatial representation of our data is useful is because it allows us to introduce an idea of ‘distance’ and therefore ‘closeness’ into our data.

For example, if we are trying to learn the ‘concept of dog’, it seems natural that a cat should somehow be closer to a dog than, say, a house.

It’s safe to say that people don’t often mistake houses for dogs, but there may be some fairly dog-like cats out there.

Left: Atchoum, the dog-cat.

Right: The doggiest house I could find.

Given some labelled data on dogs and non-dogs, we can choose some appropriate features, and plot the data in our feature space.

If we see a clean clustering of data-points that represent dogs in our feature space, then the features in our data must be characterising dogs pretty well.

If our features are doing well at describing dogs, we’d probably expect to see any data-points representing cats much closer to the dog cluster than we would expect to see any houses.

In this ideal scenario where our data has good features, the spatial distance in our feature space is analogous to the conceptual distance for the target concept we are trying to learn.

Below we can see three possible feature spaces of differing quality for a concept learning problem with 2 features:Examples of different 2 dimensional feature spaces for a concept learning problem.

The left-most feature space in the above image depicts an idealised feature space for concept learning.

Red data-points belonging to the target concept are separable from the blue data-points which do not.

Here, spatial distance corresponds to conceptual distance, which is exactly what we want.

This means there exists a perfect characteristic function based on these 2 features, and this target concept is therefore learnable.

The middle image depicts a more realistic feature space for concept learning.

Spatial distance approximately corresponds to conceptual distance and we can’t define a perfect characteristic function based on these features due to non-perfect separation, but we can learn a useful approximation to a characteristic function with small error.

In the rightmost example of a poor feature space, we can’t hope to even approximate a characteristic function with any reasonable amount of error because examples and non-examples are mixed.

In this case, the chosen features of our data do a poor job of associating spatial distance with conceptual distance.

If we have a relatively good feature space such as in the first two examples above, our classification problem becomes a problem of how to best divide up our feature space in order to capture the concept with minimal error, and this is a purely numerical problem which is the appropriate domain for machines.

This is the essential marriage between feature spaces and concept learning.

Drawing Decision BoundariesIf our intuition tells us that there is underlying information in our n-dimensional feature space that represents the concept, then getting a machine to learn a concept using these features is as simple as giving it some training data and asking it to draw a boundary in space that separates examples and non-examples with minimal error.

Then, to classify new and previously unseen examples, we can simply calculate which side of the decision boundary they’re on.

One possible decision boundary for the given 2D concept learning problem.

Every learning model for classification you have ever heard of: Support Vector Machines, Neural Networks, Naive Bayes Classifiers, K Nearest Neighbours, Genetic Algorithms… etc.

all rely (explicitly or implicitly) on this idea of feature space and they are all simply different methods for finding good decision boundaries, with different advantages and disadvantages depending on the nature of the underlying feature space.

If you are new to machine learning, you might ask: why do we need to get the machine to solve these problems at all, why can’t we just visualise our feature spaces and draw our own decision boundaries?.The answer is two-fold: firstly, we’re not looking to ‘draw’ a decision boundary at all — we’re looking for a function which takes an example’s feature values as arguments and tells us whether or not the example belongs to the concept.

While it’s true that this function can be drawn in feature space, the visual elements to this theory are just to aid in your understanding as a human.

The machine isn’t visualising feature spaces — it is performing purely numerical operations.

In order to find a function corresponding to a decision boundary, you would find that you would have to do these same calculations yourself, and it is best to leave that to a machine.

Secondly, most concepts and ideas are complex, requiring high-dimensional data and feature spaces to capture.

Patterns in such data are rarely obvious to humans with their 3 dimensional intuition.

Machines don’t see data like we do — their senses are numerical, and their decisions are binary.

Human intuition is necessary to make a concept learning problem as easy as possible for a machine but finding the pattern is the machine’s job.

Where does dog end and cat begin?.Learning to recognise a concept in an n-dimensional image is an n² dimensional learning problem if we use each pixel intensity as a feature.

Will a robot ever write a symphony?Surprisingly, it turns out a robot already has.

AIVA is the world’s first artificially intelligent composer recognised by a music society (SACEM).

AIVA stands for Artificially Intelligent Virtual Artist — its primary method of learning uses a deep neural network trained on a large number of copyright-free classical partitions.

AIVA’s music has been used commercially, and the team behind it are continuing to work on musical composition as well as explore other avenues for AIVA to express its creativity.

Suck it Spooner!AIVA is solving much more than a simple concept learning problem, but the ideas presented in this post are really at the heart of all machine learning, and will allow us to talk about more complex ideas in future blog posts.

AIVA doesn’t do all of its composition on its own; it requires human input for higher-level elements such as orchestration, so perhaps in a sense Detective Spooner was right, but it seems that we are not too far away from something like an independent machine composer.

You can listen to AIVA’s music here, as well as read an in-depth article about AIVA here.

That’s a wrap — thanks for reading!This blog post is a precursor to my next blog post which will be a little more technical and focused on the kernel trick: an elegant mathematical method that allows us to improve separation and gain more information about our data, without increasing the complexity of our feature space.

Stay tuned!If you enjoyed this introduction to concept learning and feature spaces, feel free to get in touch with me (Tom Grigg) regarding any thoughts, queries or suggestions for future blog posts!.