A Gold-Winning Solution Review of Kaggle Humpback Whale Identification Challenge

A Gold-Winning Solution Review of Kaggle Humpback Whale Identification ChallengeAn extensive yet simple review of the most noticeable approachesVladislav ShakhrayBlockedUnblockFollowFollowingMar 4Photo by Cristina MittermeierRecently, my team took part in Humpback Whales Identification Challenge hosted on Kaggle.

We won a gold medal and were placed at #10 (out of 2131 teams) on the leaderboard.

In this blog post, I will summarise the main ideas of our solution, as well as provide a brief overview of interesting and catchy methods used by other teams.

Problem DescriptionThe main goal was to identify, whether the given photo of the whale fluke belongs to one of the 5004 known individuals of whales, or it is a new_whale, never observed before.

Example of 9 photos of the same whale from the training dataThe puzzling aspect of this competition was a huge class imbalance.

For more than 2000 of classes there was only one training sample, which made it hard to use the classification approach out of the box.

What’s more, it was an important part of the competition to classify, whether the whale is new or not, which turned out to be quite non-trivial.

Class Imbalance, from kernel by Tomasz BartczakThe metric for the competition was mAP@5 (mean Average Precision at 5), which allowed us to submit up to 5 predictions for each test image.

Our highest result on the private test set was 0.

959 mAP@5.

“Sanakoyeu, Pleskov, Shakhray” ????The team consisted of Vladislav Shakhray (me), Artsiom Sanakoyeu, Ph.


student at Heidelberg University, and Pavel Pleskov, Kaggle Top-5 Grandmaster.

We joined our forces with Artsiom in the middle of the competition to speed-up the experiments, and Pavel joined us one week prior to the team merger’s deadline.

Validation and Initial SettingA couple of months before this competition, a playground version of the same competition was hosted on Kaggle, but, as it was noted by the competition hosts, the real (non-playground) version featured even more data and cleaner labels.

We decided to utilize the knowledge and the data from the previous competition in numerous ways:Using the data from the previous competition, we used image hashing to collect over 2000 validation samples.

This proved to be crucial when we validated our ensembles later.

We removed the new_whale class from the training dataset, as it does not share any logical image features between its elements.

Some images were not aligned at all.

Luckily, there was a publicly available pre-trained bounding boxes model used in a winning solution of the playground competition.

We used it to detect the precise bounding box around the whale fluke and cropped the images accordingly.

Due to the different color gammas of the images, all data was converted to grayscale prior to training.

Approach #1: Siamese Networks (Vladislav)Our first architecture was a siamese network with numerous branch architectures and custom loss, which consisted of a number of convolutional and dense layers.

The branch architectures that we used included:ResNet-18, ResNet-34, Resnet-50SE-ResNeXt-50ResNet-like custom branch shared publicly by Martin PiotteWe used hard-negative as well as hard-positive mining by solving Linear Assignment Problem on the matrix of scores every 4 epochs.

A little of randomizations were added to the matrix to ease the training process.

Progressive learning was used, with the resolution strategy 229×229 -> 384×384 -> 512×512.

That is, we first trained our network on 229×229 images with little regularization and larger learning rate.

After convergence, we reset the learning rate and increased regularization, consequently training the network again on images of higher resolution (e.



Furthermore, due to the nature of data, heavy augmentations were used that included random brightness, Gaussian noise, random crops, and random blur.

In addition, we pursued a smart flipping augmentation strategy that significantly helped to create more training data.

Specifically, for every pair of training images belonging to the same whale X, Y , we created one more training pair flip(X), flip(Y).

On the other hand, for every pair of different whales, we created three more examples flip(X), Y, Y, flip(X) and flip(X), flip(Y).

An example showing that random flips strategy does not work with a pair of same-whale photos.

Notice how the lower photos become different when we flip one of the images since we care about fluke orientation.

The models were optimized using Adam optimizer with an initial learning rate of 1e-4, reducing 5 times on plateau.

The batch size was set to 64.

The source for the models was written in Keras.

It took 2–3 days (depending on image resolution) to train the models for about 400-600 epochs on a single 2080Ti.

The best-performing single model with ResNet-50 scored 0.

929 LB.

Approach #2: Metric Learning (Artsiom)Another approach that we used was metric learning with Margin Loss.

We used numerous ImageNet-pretrained backbone architectures which included:ResNet-50, ResNet-101, ResNet-152DenseNet-121, DenseNet-169The networks were trained progressively mostly using 448×448 -> 672×672 strategy.

We used Adam optimizer, decreasing the learning rate 10 times after 100 epochs.

We also used a batch size of 96 for the whole training.

The most interesting part is what gave us 2% boost right away.

It is a metric learning method that was developed by Sanakoyeu, Tschernezki, et al.

and was accepted for publication at CVPR 2019.

What it does is every n epochs it splits the training data as well as the embedding layer into k clusters.

After setting up the bijection between the training chunks and the learners, the model trains them separately while accumulating the gradients for the branch network.

You can check out this paper along with the code when it is published here.

“Divide and Conquer the Embedding Space for Metric Learning”, Artsiom Sanakoyeu, Vadim Tschernezki, Uta Büchler, Björn Ommer, In CVPR 2019Because of the huge class imbalance, heavy augmentations were used, which included random flips, rotate, zoom, blur, lighting, contrast, saturation change.

During inference, dot products between the query feature vector and the train gallery feature vectors were calculated and a class with the highest dot product value was selected as the TOP-1 prediction.

Another trick that implicitly helped with the class imbalance was averaging the feature vectors for the train images belonging to the same whale ids.

The models were implemented using PyTorch and took 2–4 days (depending on the image resolution) to train on a single Titan Xp.

It is noteworthy to mention that the best-performing single model with a DenseNet-169 backbone scored 0.

931 LB.

Approach #3: Classification on Features (Artsiom)When I and Artsiom joined forces, one of the first thing that we did was training the classification model using the features extracted from all our models and concatenated together (after applying PCA, of course).

The head for the classification consisted of two dense layers with dropout in between.

The model trained very quickly because we used precomputed features.

This approach allowed us to get 0.

924 LB and brought even more diversity in the overall ensemble.

Approach #4: New Whale Classification (Pavel)One of the most complicated parts of this competition was to correctly classify the new whales (as about 30% of all images belonged to the new_whale class).

The popular strategy to deal with this was to use a simple threshold.

That is, if the maximum probability that the given image X belongs to some known whale class is smaller than the threshold, it was classified as the new_whale.

However, we thought that there may be better ways for tackling the problem.

For each best-performing model and ensemble, we took its TOP-4 predictions, sorted in descending order.

Then, for every other model, we took their probabilities for the selected 4 classes.

The goal was to predict whether the whale is new or not based on these features.

Pavel created a very powerful blend of LogRegression, SVM, several k-NN models, and LightGBM.

The combination of all gave us 0.

9655 ROC-AUC on cross-validation and increased the LB score by 2%.

EnsemblingConstructing the ensembles out of our models definitely wasn’t a piece of cake.

The thing is that my models’ output was a matrix of unnormalized probabilities (from 0 to 1), while the output matrices provided by Artsiom consisted of euclidean distances (thus ranging from 0 to infinity).

We tried numerous methods to transform Artsiom’s matrices to probabilities, which included:t-SNE-like transformation:SoftmaxSimply reversing the range by applying function 1 / (1 + distances)Lots of other functions to reverse the matrices’ rangeUnfortunately, the first two methods didn't work at all, while using mostly any function to clip the range to [0, 1], the result was approximately the same.

We ended up choosing this function by selecting one with the highest mAP@5 on the validation set.

Surprisingly, the best one was 1 / (1 + log(1 + log(1 + distances))) .

Methods Used by Other TeamsSIFT-BasedI would like to outline one solution which was, in my point of view, one of the most beautiful and, at the same time, unusual.

David, now a Kaggle Grandmaster (Rank 12), was 4th on the Private LB and shared his solution as a post on Kaggle Discussions forum.

He worked with full-resolution images and used traditional keypoint matching techniques, utilizing SIFT and ROOTSIFT.

In order to deal with false positives, David trained a U-Net to segment the whale from the background.

Interestingly, he used smart post-processing to give classes with only one training example more chance to be in the TOP-1 prediction.

We also thought about trying SIFT-based methods, but we were convinced that it would definitely perform poorer than the top-notch neural networks.

The takeaway, in my opinion, is that we should never be blinded by the power of deep learning and underestimate the abilities of traditional methods.

Pure ClassificationThe team Pure Magic thanks radek (7th place), consisting of Dmytro Mishkin, Anastasiia Mishchuk and Igor Krashenyi, pursued approach that was a combination of metric learning (triplet loss) and classification, as Dmytro described in his post.

They tried Center Loss to reduce overfitting when training classification models for a long time, along with temperature scaling before applying softmax.

Among the numerous backbone architectures that were used, the best one was SE-ResNeXt-50, which was able to reach 0.

955 LB.

Their solution is way more diverse than that, and I highly suggest you to refer to the original post.

CosFace, ArcFaceAs it was mentioned in the post by Ivan Sosin (his team BratanNet was 9th in this competition), they used CosFace and ArcFace approaches.

From the original post:Among others Cosface and Arcface stand out as newly discovered SOTA for face recognition task.

The main idea is to bring examples of the same class close to each other in cosine similarity space and to pull apart distinct classes.

Training with cosface or arcface generally is classification, so the final loss was CrossEntropy.

When using larger backbones like InceptionV3 or SE-ResNeXt-50, they noticed overfitting, so they switched to lighter networks like ResNet-34, BN-Inception and DenseNet-121.

The team also used carefully selected augmentations and numerous network modification techniques like CoordConv and GapNet.

What was particularly interesting in their approach is the way they dealt with new_whales.

From the original post:Starting from the beginning we realised that it is essential to do something with new whales in order to incorporate them into the training process.

Simple solution was to assign each new whale a probability of each class equal to 1/5004.

With the help of weighted sampling technique it gave us some boost.

But then we realised that we could use softmax predictions for new whales derived from the trained ensemble.

So we came up with distillation.

We choose distillation instead of pseudo labels, because new whale is considered to have different labels from the train labels.

Though it might not really be true.

To further boost the model capability we added test images with pseudo labels into the train dataset.

Eventually, our single model could hit 0.

958 with snapshot ensembling.

Unfortunately, ensembling trained this way didn’t give any score improvement.

Maybe it was due to less variety because of pseudo labels and distillation.

Final ThoughtsFinal StandingsWhat is quite surprising is that there was almost no shake-up in the end, in spite of the fact that private test set contributed to almost 80% of all test dataset.

I believe that competition hosts worked very well to provide a very interesting problem, along with clean and processed data.

It was the very first Kaggle competition that I participated in, and it definitely proved how interesting, engaging, motivating and educating Kaggle competitions could be.

I want to congratulate the people who became Experts, Masters, and Grandmasters thanks to this competition.

I want to also thank ODS.

ai community for amazing discussions and support.

Finally, I want to especially thank my team members Artsiom Sanakoyeu and Pavel Pleskov once again for an unforgettable Kaggle competition experience.


. More details

Leave a Reply