It seems like a contradiction, and in one sense, it is.
We do lose information, but if we’re clever, we can minimise this.
To understand how we can achieve this, it helps to visualise it in only two dimensions.
Imagine you have a series of dots on a two-dimensional chart, like this:The position of each of these dots can be written as two coordinates — two columns of data.
Now imagine that we draw a straight line next to these dots, and “project” their positions onto the line, like this:The location of each dot on the line can be written as just one coordinate, its position along the line.
We’ve lost some information about the original dots, but, as long as we’ve drawn the line at the right angle to the dots, we’re preserving something about their relative positions to each other.
“Dimensionality reduction” is a set of mathematical techniques for choosing the right angle for that line, so that we preserve as much information as possible.
This is easy to visualise in two dimensions, but it also works in multiple dimensions.
In our case, we’re reducing the number of dimensions (columns of data) from thousands to just a hundred.
The choice of a hundred columns is slightly arbitrary.
Reducing the number of columns reduces “noise” in the data — arbitrary and useless information.
But if we reduce the number of columns too much, we lose useful information as well.
Here’s our films reduced to just two columns of information:Table 4: Films reduced to two dimensionsThe two remaining columns are impossible to interpret — so much information has been lost that the remaining two columns are effectively meaningless.
We need to find a balance, and it turns out that in this case, for this dataset, a hundred columns is about right.
Here’s a small sample of what that looks like:Table Five: Films as “vectors”The meaning of each column is even more difficult to interpret than the single column we had previously, but collectively, they encode some quite sophisticated information about the films in our dataset.
We can test this by looking at which films look similar, according to the data in this table.
If we’ve done a good job, similar films in our table should be very similar films in real life.
For example, similar films to “The Hangover”, according to this data, include the similarly dire “The 40 Year Old Virgin”, “The Hangover Part II”, and “Knocked Up”.
It looks like we’ve done pretty well — all of these films are from a similar vein of cynical, misogynistic comedy.
They share similar subject material and a similar style of comedy.
The closest film to Ben Affleck’s career highlight, 1998’s Armageddon, is “Deep Impact”, which makes sense since it’s basically exactly the same film only with (tragically) less Aerosmith.
We can do some other fun things with this data set, but to explain them, I’ll have to digress into some details about similarity measures.
Deep breath everyone.
In the last chapter, I showed how “euclidean distance” could be used to measure the similarity of two rows of numeric data.
By plotting them as points in multi-dimensional space, and drawing a line between the two points, you can measure the similarity of the two rows.
That approach works for this dataset as well.
But there’s another approach.
Instead of measuring the distance between two points, we can measure the angle between them.
For reasons that make sense to people who know more about maths than I do, this is called “cosine similarity”.
For example, here’s two classic nautical-themed films, “Finding Nemo” and “Jaws”, plotted by their similarity to 1990’s thoughtful independent film “Teenage Mutant Ninja Turtles”, and John Carpenter’s low-budget action classic “Escape From New York”.
The angle between the two points — around 40 degrees, indicates that the films are fairly dissimilar, at least on the “Foam Turtle/Snake Plisskin” scale provided by these two axis.
For most purposes, cosine similarity and euclidean distance give very similar results.
But there is one property of cosine similarity that is very different.
Cosine similarity remains the same regardless of the magnitude of the two lines.
In other words, the total amount of each value doesn’t matter, only the ratio of one value to the other.
This is useful because it means we can add the values of two films together, and still measure similarity effectively.
That’s a bit esoteric sounding, so let’s investigate that further.
“The Thing”, another classic John Carpenter film, is even further from Finding Nemo than is Jaws.
If we add The Thing and Finding Nemo’s values together, we can create a hypothetical new film “The Thing About Nemo” (for obvious reasons, we’ll avoid calling it “Finding Nemo’s Thing”).
“The Thing About Nemo”, because it combines the values of two other films, sits much further along both axis than any of the other films.
It is literally the sum of the two films — Finding Nemo has around 8% of its keywords in common with Teenage Mutant Ninja Turtles, and The Thing has around 3%.
That gives “The Thing about Nemo” a score of around 11% on this axis — 11 + 3 = 13.
This means that our new film is quite distant from the other films on the chart — it has mugh higher values on both axes.
But because we’re using cosine similarity, and measuring the angle, not the distance, we find that “The Thing About Nemo”, our combination of two films, is in fact very similar to “Jaws”.
Let’s review what’s happened here.
We’ve combined a film about aquatic adventure, with a film about being stalked by an alien monster, and the result is very similar to Jaws, a film about being stalked by an aquatic monster.
Is this a coincidence?.Is this just an accident of maths, or do the numeric values we’ve derived actually encode some useful fragment of meaning about the films?.Above, I promised that we’d build something that actually understood films.
Have we succeeded in doing that?Never stop swimmingWhen using just two axes, as we have above, it turns out that this approach returns nonsensical results more often than not.
But when we use the full 100-column data set we created earlier, the results become eerily insightful.
Just as when we used two axes, we take the sum of two films, adding the values in each of the 100 columns.
Then, using cosine similarity, we find the film with the smallest angle — the least difference — to the summed row.
Using this approach, Kathryn Bigelow’s best film, 1991’s “Point Break”, added to Disney’s “Cars”, yields a film that, by cosine similarity, is most similar to 2001’s “The Fast and the Furious”.
This is spookily accurate.
“Point Break” and “Fast and the Furious” are both about maverick police officers, drawn into a seductive subculture in order to apprehend a master criminal, with whom he has an intense homoerotic relationship.
The only differences between them are Swayze’s sparkling eyes and that, while Point Break focuses on surfing culture, the Fast and the Furious fetishises cars.
Try this for yourself, at moviemaths.
comBy the same process, we can try adding the revolutionary horror film, 1979’s Alien, to 2004’s high-school comedy, Mean Girls.
The resulting values are, according to our cosine similarity calculation, very similar to the Stephen King adaptation, “Carrie”, a horror film about mean girls in a high school.
Perhaps in an attempt to reconcile the interests of a diverse group of friends, we might try to find a film combining the qualities of Schindler’s List and The Hangover.
The algorithm informs us that our best bet is the eminently forgettable WWII action film “Enemy at the Gates” (2001).
This approach also works with subtraction.
If we take the disturbingly racist “Indiana Jones and the Temple of Doom”, and subtract its slightly-less-racist predecessor, “Raiders of the Lost Ark”, the resulting film is most similar to 1941’s “Tarzan and the Leopard Woman”, a film which, to put it mildly, does little to contribute to positive race-relations.
What’s happening here?.These combinations don’t always make complete sense, but there’s a hint of something meaningful behind them.
What we have, in the end of this process, is an encoding of data about a film which seems to contain something of the meaning of the film.
The algorithm is still just numbers and maths, but the results, such as our movie-combining tricks, feel like the product of something that actually knows something about films.
What we’ve seen feels tantalisingly close to that vision of a computer that actually understands us.
The way we transformed the data allows the computer to generate unique insights about the movies in our dataset.
It can give us a novel perspective on them.
Have we succeeded in making something that actually understands films?.It’s easy to feel like the algorithm has generated some kind of understanding of the information it has been given.
Techniques like those we’ve used above are increasingly used in all kinds of domains to parse and filter huge amounts of information.
Law offices are being revolutionised by algorithms which can scan millions of documents to find relevant case law.
Doctors can call upon automated processes to find recent relevant literature to help treat a patient’s illness.
We can even produce coherent and useful summaries of these documents, all at the press of a button.
Computers can, at least within certain domains, read and understand text.
But it also seems like there’s a vast gap between this slightly uncanny trick, and real human insights.
The computer might get an answer that makes sense, but does it really know what that answer means?.It feels like we have constructed an elaborate magician’s illusion — the rabbit comes out of the hat, and if we forget about the hat’s false bottom and the magician’s dexterity, we can almost believe that it’s magic.
Is there a difference between real intelligence, and this simulation of it?.That question is something we’ll return to it again as we delve into even more sophisticated algorithms.
For now, it feels like we’ve created an interesting toy, but it still falls far short of “true” intelligence.
In coming essays, that line will become increasingly blurred.
We’re going to need a bigger boat.
The previous essay in this series, “Knitting and Recommendations” can be found here.
The next essay will be published in May.
To experiment with movie maths yourself, go to moviemaths.
All the code for this essay can be found at my github, here.