Let’s check three possible thresholds: 0.
5, and 0.
75:The fraction of ground truth footprints correctly identified (purple), identified at too low of an IoU score (orange), or missed completely (red) at three different IoU thresholds and stratified by look angle.
Some competitors showed more dramatic performance drops than others as the threshold increases.
As you can see, each competitor’s correct identification rate (purple) drops as the threshold is increased; however, some drop more than others based on how well each prediction matched its respective building.
The orange portion of the bars represent buildings that were identified, but “not well enough” — with an IoU score greater than zero but less than the threshold.
The red bars represent buildings that were missed completely — an IoU of 0 — a population which, of course, is unaffected by IoU threshold.
Performance by look angleNext, let’s examine how each competitor’s algorithm performed at every different look angle.
We’ll look at three performance metrics: recall (the fraction of actual buildings identified), precision (the fraction of predicted buildings that corresponded to real buildings, not false positives), and F1 score, the competition metric that combines both of these features:F1 score, recall, and precision for the top five competitors stratified by look angle.
Though F1 scores and recall are relatively tightly packed except in the most off-nadir look angles, precision varied dramatically among competitors.
Unsurprisingly, the competitors had very similar performance in these graphs, consistent with their [tight packing at the top of the leaderboard].
Most notable is where this separation arose: the competitors were very tightly packed in the “nadir” range (0–25 degrees).
Indeed, the only look angles with substantial separation between the top two (cannab and selim_sef) were those >45 degrees.
cannab seems to have won on his algorithm’s performance on very off-nadir imagery!An interesting takeaway from the bottom two graphs is that competitors had a bigger separation in their precision than in their recall, meaning that there was more variation in false positive rates than false negative rates.
With the exception of a substantial drop-off by selim_sef’s algorithm in the most off-nadir imagery, the five competitors’ recall scores were almost identical throughout the range of look angles.
By contrast, selim_sef had markedly better precision in the very off-nadir images than any other competitor, though cannab also clearly beat the other three prize-winners in this metric.
cannab and selim_sef were the only two competitors to use gradient boosting machines to filter false positives out of their predictions, which likely gave them the upper hand in precision.
One final note from these graphs: there are some odd spiking patterns in the middle look angle ranges.
The angles with lower scores correspond to images taken facing South, where shadows obscure many features, whereas North-facing images had brighter sunlight reflections off of buildings:Two looks at the same buildings at nearly the same look angle, but from different sides of the city.
It’s visually much harder to see buildings in the South-facing imagery, and apparently the same is true for neural nets!This pattern was even stronger in our baseline model.
Look angle isn’t all that matters — look direction is also important!Seeing how similar these patterns are, we next asked how similar the competitors’ predictions are.
Did they identify the same buildings and make the same mistakes, or did different algorithms have different success/failure patterns?Similarity between winning algorithmsWe examined each building in the imagery and asked how many competitors successfully identified it.
The results were striking:Histograms showing how many competitors identified each building in the dataset, stratified by look angle subset.
The vast majority of buildings were identified by all or none of the top five algorithms — very few were identified by only some of the top five.
Over 80% of buildings were identified by either zero or all five competitors in the nadir and off-nadir bins!.This means that the algorithms only differed in their ability to identify about 20% of the buildings.
The algorithms differed more in the very off-nadir range, but still only 30% of buildings were found by one or more of the competitors that were not found by all of them.
Given the substantial difference in computing time needed to train and generate predictions from the different algorithms, we found this notable.
Another way to explore how similar the algorithms are to one another is to measure the Jaccard similarity between their predictions.
For each pair of algorithms, we counted how many buildings both identified, then divided that by the set of buildings that at least one found (that is, the IoU of their prediction sets):The Jaccard similarity between competitors’ prediction sets.
Higher scores mean that competitors’ predicted more of the same buildings, and fewer different ones.
Though the similarity decreased as look angle increased, they all remained very high — the lowest similarity score was 0.
7, between XD_XD and number13’s algorithms in the very off-nadir images (>40 degrees off-nadir).
Note the scale bar on the right — the lowest Jaccard similarity between any prediction set was 0.
7, and that only occurred between number13 and XD_XD in the very off-nadir angle subset.
No two algorithms generated predictions that were less than 80% similar by this metric when considering the entire dataset.
False positives: where did algorithms fail?As the algorithms’ correct predictions were very similar, we next asked if the same was true of their false positive predictions — places where algorithms incorrectly predicted buildings in the images.
We split the false positives into two sets: all predictions that did not satisfy the IoU threshold of 0.
5, and the subset that did not overlap with an actual building at all (an IoU of zero).
We then went through every competitor’s false positive predictions, counting how many of them overlapped between pairs of competitors.
We used the same Jaccard metric to quantify their similarity:The Jaccard similarity between false positive predictions — that is, the fraction of false positives that overlapped with false positives from another competitor.
The IoU threshold for an overlap between false positives was set to >0 — that is, any overlap between two false positive predictions was counted, with no threshold.
The top panel represents all false positive predictions (IoU < 0.
5 with buildings in the actual dataset), whereas the bottom panel represents predictions that did not overlap with actual buildings at all.
Though there was more variability in false positives than in correct predictions, we still found these results intriguing: the five different algorithms, comprising different neural net architectures, loss functions, augmentation strategies, and inputs, often generated very similar incorrect predictions.
For example, cannab and selim_sef’s IoU = 0 false positives (those that did not overlap with an actual building at all) overlapped with one another over 95% of the time!.After checking to ensure that these false positives did not correspond to actual buildings that were missed by manual labelers (a few were, but under 10% of total), we found some interesting examples to show here:Examples of false positive predictions from cannab (red) and selim_sef (blue).
The purple overlap at the top and bottom right represent predicted buildings where there was none — at the top, where there was a tree between two sheds, and in the bottom right, in an especially dark shadow from a tree.
Note that these false positives were very rare!Another example of false positives from cannab (red) and selim_sef (blue).
They both predicted that the dilapidated foundation in between these two houses was an actual building.
Now that we’ve explored dataset-wide statistics, let’s drill down to what made some buildings easier or harder to find.
For this portion, we’ll focus on cannab’s winning algorithm.
building sizeThe size of building footprints in this dataset varied dramatically.
We scored competitors on their ability to identify everything larger than 20 square meters in extent, but did competitors perform equally well through the whole range?.The graph below answers that question.
Building recall (y axis) stratified by building footprint size of varying size (x axis).
The blue, orange, and green lines represent the fraction of building footprints of a given size.
The red line denotes the number of building footprints of that size in the dataset (right y axis).
Even the best algorithm performed relatively poorly on small buildings.
cannab identified only about 20% of buildings smaller than 40 square meters, even in images with look angle under 25 degrees off-nadir.
This algorithm achieved its peak performance on buildings over 105 square meters in extent, but this only corresponded to about half of the objects in the dataset.
It is notable, though, that this algorithm correctly identified about 90% of buildings with footprints larger than 105 square meters in nadir imagery.
Occluded buildingsFor the first time in the history of the SpaceNet Dataset, we included labels identifying buildings occluded by trees.
This allows us to explore building detection in the densely treed suburbs surrounding Atlanta (see the image at the beginning of this post).
So, how well did the best algorithm do at identifying buildings partially blocked from the satellite by overhanging trees?cannab’s algorithm showed a small but appreciable drop in performance when measuring its ability to segment buildings occluded by trees.
X axis, look angle of the image; y axis, recall.
cannab’s algorithm only showed a small drop in performance for occluded buildings.
This is encouraging: it indicates that algorithms can learn to work around occlusions to find unusual-shaped subgroups of a class, still classifying their footprints correctly.
ConclusionThe top five competitors solved this challenge very well, achieving excellent recall and relatively low false positive predictions.
Though their neural net architectures varied, their solutions generated strikingly similar predictions, emphasizing that advancements in neural net architectures have diminishing returns for building footprint extraction and similar tasks.
Object size can be a significant limitation for segmentation in overhead imagery, and look angle and direction dramatically alter performance.
Finally, much more can be learned from examining the winning competitors’ code on GitHub and their descriptions of their solutions, and we encourage you to explore their solutions more!What’s next?This brings us to an end of the SpaceNet Challenge Round 4: Off-Nadir Building Detection.
Thank you for reading and we hope you learned as much as we did.
Follow The DownlinQ on Medium and the authors on Twitter @CosmiQWorks and @NickWeir09 for updates on the next SpaceNet Challenge coming soon!.