There are a number of ways I can think of to do this, but the easiest for me was to just build a single class object detector to recognize faces in images, pass that face into a pretrained network for feature extraction, then pass that feature vector into annoy.
When annoy pulls the similar images I can just have it return the base images so basically it finds which faces in its database are most similar to the new extracted face and returns the images those faces appeared in.
Facial Recognition Based SimilarityFor this project I built a simple 1 class object detector to just recognize anime character faces in images.
I used Labelimg to quickly annotate the images and I am pretty sure it only took me around 20 minutes to label the 400 images for my test and training splits.
I have done quite a few of these at this point and it only being a single class speeds up the process significantly.
When training the detector I used a Faster-RCNN-Inceptionv2 model with the coco weights from the Tensorflow model zoo and trained the model for around 3 hours.
I started training at midnight on Friday until around 3am on Saturday which has thrown a bit of a wrench in my sleep schedule since I was up working on some other stuff then.
The object detector trained up fairly quickly and the output looks quite clean.
This is heartening since it will be the key piece in this pipeline to find more character specific similar images.
While using the object detector to crop the heads out of the original dataset, I saved a csv mapping of the heads to their original images.
My idea being that I could run a feature extractor on the headshot and store that in the annoy model and when the time comes I can match the annoy output to the original image.
Feature Extraction with Pytorch and AnnoyNow that I can extract heads from images all I had to do was pass those heads through a feature extractor (once again a ResNet101), then pass those feature vectors to annoy.
As a demo here is one of the images from before where the raw image model had some issues.
This is an example output of the object detector detecting two faces in the image.
So each image will have features extracted from it and then matched against the larger database with annoy.
The first output is from the character on the left of the main image (who does appear in the dataset) and the first two and final images of the 4 similar ones are of that character.
This is an improvement of over the raw image input for this image where there were 0 matches.
The second character (one on the right) doesn’t actually appear in the database… but she is basically identical in facial features to the one on the left so of the four images 2 match (1st similar and 4th)This appears to be a good improvement over just using the base image since the goal is to return similar characters.
Now lets check out the other example image I used before for the base model.
So in this one rather than getting a bunch of just red black images they seem a little more tailored.
While the first, second, and fourth similar images are of a different character the 3rd is of the same character.
This time all are at least of the same gender of the base image.
The other one just had all male characters paired with a base image of a female character.
While this is not a great result it seems to be an improvement over the previous version.
Closing ThoughtsAfter looking through the output of these two pipelines I felt that these results were acceptable, but not great.
Using the base images returned images that have similar feels but not necessarily similar characters.
While the face detector helped to focus the outputs to be of similar characters, it often did not return images of an overall similar style.
While 2 of the 4 returned images are not of the same character I do actually like the 2nd result in the middle because it has a similar “feel” to the base image.
As I mentioned before, the headshot based model focuses well on the characters in question.
In this case all characters are the same.
However It doesn’t match the feel of the original image.
What I really want is some combination of the two where I can get similar characters and similar overall images (basically I selfishly want to have my cake and eat it too).
After some experimenting I found that I was able to get pretty close to that.
As I alluded to at the beginning of this post, getting better output from this pipeline basically comes down to modifying what data gets condensed down into that final feature vector that gets passed into annoy.
While in most things having your cake and eating it too isn’t feasible, in this case it is!.I would argue that this “new model” does a better job than the other two by getting all the correct character (beats out the base image model) and displays images that have a closer “feel” to the base image than the headshot model.
Per usual this was just a situation where I had to attack the problem from a new point of view.
Still really enjoy big hero 6 and code to Immortals as a theme song for my lifeI just had to rethink what information I was encoding into that final feature vector.
What I ended up doing was passing both the information from the detected headshot and the base image into annoy for a combined feature vector that captured information about both the character’s face (to get similar characters) and the base image (to get the overall “feel”).
However this final model wasn’t quite that straightforward and took me a bit to figure out so I will throw a second followup post at it to keep this post to a reasonable length.
So tune in next time where I will walk through how I combined the face and base image information into a dense representation to let Spotify’s annoy find similar images in terms of character and feel.
Once again, feel free to check out the notebooks I used for this here.
The notebooks arn’t super readable since I was hacking through things pretty quickly.
I also don’t provide model/dataset files in the repo per usual.
.. More details