Nimbus — Cloud Segmentation using Deep Learning for Agriculture.

Nimbus — Cloud Segmentation using Deep Learning for Agriculture.

Gulfaraz RahmanBlockedUnblockFollowFollowingJan 3Nimbus is built to remove clouds and the shadows cast by these clouds in Sentinel-2 satellite images gathered from the Copernicus portal.

We consider clouds as noise that needs to be removed in order to monitor agricultural land.

Problem scope in the high-level overview.

The ProblemFind and remove clouds and their shadows on satellite images.

Formulated as a pattern recognition problem, we use deep learning on a small annotated set of satellite images to train a model to remove clouds and accompanying shadows.

Follow the link below to understand why we want to solve this problem,Deep Learning for Agriculture.

A framework for adapting deep learning techniques to leverage satellite data for large scale geo-analysisjustai.

nlThe DatasetThe training data consists of 21 manually labelled pairs of images with and without clouds and shadows.

The validation set is 7 similar pairs with a test set of 8 pairs in the same geographical area and 4 images from an unseen neighbouring area.

All images are of resolution 6000 x 5000, except those of the neighbouring area which are 10980 x 10980.

All images are captured from Sentinel-2, from which only four image channels are used — red, green, blue and near infrared.

Satellite image with clouds and shadows (left) / Satellite image with clouds and shadows removed (right).

The SolutionMagic! — well, almost.

We use Deep Learning with Convolutional Neural Networks, back-propagation and Stochastic Gradient Descent with the help of PyTorch.

This frees us from any need to understand how clouds cast shadows and we only focus on a training pipeline of the format input-model-target.

The input is the satellite image of a region covered with clouds.

The difference between the input image and the annotated image (input image with clouds removed) results in a mask which serves as the target for our model.

The mask is a binary image — with a value of 0 or 1 for each corresponding pixel of the input image.

We choose the U-Net architecture [1] for our model for this binary segmentation task which is similar to SegNet architecture with shortcut connections from the encoder to the decoder.

SegNet segmentation using Deep Learning.

Training PipelineBootstrap image pairs from the datasetAdd invariance using data augmentationTraining the deep networkPost-processing and model evaluationBootstrap image pairs from the datasetWith only 21 image pairs for training, our dataset is very small to perform any reasonable learning.

It also suffers from class imbalance, as most images have clouds in them which would lead the model to believe there will be clouds in every image.

The training image size and test image size differ but we would like our model to work well on all image sizes.

Each image is huge, it takes unreasonably large memory space to load the whole image then process it.

We tackle all of these issues with one technique — bootstrapping.

Benefits of bootstrapping.

Bootstrapping allows us to infinitely sample smaller images from the larger images which transform our small dataset to a very large dataset.

One image without clouds has now become infinite samples of smaller images without clouds thus alleviating the impact of class imbalance.

We can fix the size of the sampled images which resolves the issue of varying image sizes.

By choosing a small size we can also avoid the problem of large images thus allowing us to load multiple images into memory for batch training.

Bootstrapping inflates our training data points from 21 to 210,000 by sampling 100,000 images of 32×32 size from each of the 21 training images.

Given enough samples, the class imbalance problem diminishes.

Add invariance using data augmentationData augmentation is a common technique to inflate datasets by including rotation and mirrored copies of the original data points.

Using bootstrapping, we have resolved the size problem of the dataset but could still use the other benefits of data augmentation.

By training the model with rotated and mirrored variants we teach our model to be more robust to such changes.

A rotated cat is still a cat, our problem is to mask the cat — or cloud.

We train the model with combinations of 90/180/270-degree rotations with horizontal and vertical flips.

Training the deep networkThe input images are passed through the network which performs convolutions and transformations to produce a binary mask (using a sigmoid).

The generated binary mask is compared with the target mask to estimate a difference or loss.

We train and compare models using three different functions — Binary Cross Entropy (basic), Lovasz Hinge (Jaccard approximation) and Iglovikov (= BCE – log jaccard_approx) [2] losses.

The non-linearity used between convolutions is ReLU and regularization is done using batch normalization.

The models are trained for 100 epochs each.

Post-processing and model evaluationWith the above systems in place, we were successful in training models which could predict segmentation masks.

Prediction introduces a threshold hyper-parameter which is used to binarize the output mask.

The models' raw output is an image with each pixel value in the range [0-1], but it is preferred to apply a binary mask on the input image.

All pixel values below the prediction threshold are set to 0 and the rest are set to 1.

When the binarized mask is applied on the input image, it produced rough edges and insignificantly small holes in the clouds, we apply a 5×5 Gaussian filter to smooth out the edges and fill the small gaps.

Inspiration for Test Time Augmentation (Source [2]).

At this stage, the results suffered from an unexpected effect.

The predictions had visible blocks due to the splitting and merging of the large image.

A Kaggle [2] team resolved these ‘local boundary effects’ by shifting the boundary pixels to the centre with a fixed offset to make a second prediction and use the mean.

A second prediction increases execution cost but resolved the block effect and also lowered variance as a side effect.

Using the mean prediction smooths the confidence of the predicted output.

To further reduce the variance in the output we replicate data augmentation method at prediction.

We predict on rotated and mirrored copies of the input and average the predictions for each pixel.

Hyper-parameter TuningThe goal of hyper-parameter tuning is to find the combination of settings which maximize the Jaccard Score.

This was done using grid search (use random search [3] instead) over the following:Input Image Size — 16×16, 32×32, 64×64, 128x128Loss — BCE, Lovasz, IglovikovPrediction Threshold — 0.

1, 0.

2, 0.

3, 0.

4, 0.

5, 0.

6, 0.

7, 0.

8, 0.

9Learning Rate — 0.

1, 0.

01, 0.

001The evaluation scores are of the validation set — data unseen by the model.

Influence of Input Image Size at different thresholds for models trained with the different loss functions on Jaccard score.

From the above graphs we observe that Input Image Size 32×32 outperforms the other sizes on average (consistently on Lovasz Loss).

The curves peak at prediction threshold values 0.

5 or 0.


To have a closer look at the influence of loss function we fixed the input size to 32 and observed the model performance.

The learning rate was set to 0.

01 with Adam as the optimization algorithm.

We reiterate that the hyper-parameters were chosen to maximize robustness to boost generalization of the model.

Putting together the components described above:A high-level processing pipeline.

ResultsAll models were trained with hyper-parameters on 5 NVIDIA GK110GL Tesla K20m GPUs (11 days in total).

The following scores were obtained on the test dataset,╔════════════════╦══════════╦══════════╦═══════════╗║ ║ BCE ║ Lovasz ║ Iglovikov ║╠════════════════╬══════════╬══════════╬═══════════╣║ Test Score ║ 0.

9786 ║ 0.

9786 ║ 0.

9734 ║║ Threshold ║ 0.

5 ║ 0.

6 ║ 0.

5 ║╚════════════════╩══════════╩══════════╩═══════════╝Not all settings performed well but there were a few with high test scores, allowing us a few options to choose from.

We visually observe an example from the model trained on BCE loss,Input (Top Left), Ground Truth (Top Right), Masked Input (Bottom Left) and Predicted Mask (Bottom Right) Images.

This is another example prediction visualized for different values of prediction thresholds,Ground Truth and Predicted Masks for different threshold values.

We were able to improve results for specific cases but in general our original choice of 0.

5 generalized best.

We attempted to trick the model with an input image with no clouds,Input, Ground Truth and Masked Input for a satellite image with no clouds.

The model correctly predicted that there were no clouds in the input image by generating a (nearly) blank mask.

Taking a closer look at our first example we see that the model overrides some areas of the ground truth.

The blobs of the mask at the centre of the image are not found in the predicted mask but in their place we find smaller chunks corresponding to the small clouds clearly visible in the input image.

Visual comparison of a) input and masked input images (left) / b) ground truth and predicted mask (Right).

There are a few such examples which indicate that the model has understood what clouds really look like and does not simply follow the provided ground truth.

These differences bring down the score but at this point, we agree with our model’s prediction over the ground truth — at least for this instance.

From the masked input image, we can see that the model was able to remove all clouds.

Although the learning is in the right direction we believe there is scope for improvement in terms of smoothness and generalization.

Additional fine-tuning and parameter testing can further improve the model performance.

More importantly, we need a more robust evaluation method than the Jaccard score and manual visual inspection — this is a common problem in all image comparison tasks.

Without any modifications to this solution, we trained a model to detect plots in satellite images.

The details of that experiment are described in,Reusing Nimbus for Plot Detection.

Use deep learning to identify the boundaries of farms in satellite images.


nlFuture WorkThis project served as a proof of concept that Deep Learning has a place in Agriculture.

We have merely scratched the surface and is far from complete.

Listed below are some immediately visible future steps,As in all Deep Learning solutions, the more (diverse) data is learned from the better.

We trained under different weather conditions from images captured throughout the year.

Training with images from more diverse conditions (in day and night) and from other regions will help build a more robust solution for cloud segmentation.

The input data used only includes RGB-NIR channels, newer satellite images have additional image channels which may capture information useful to make better predictions.

We did not use traditional methods [4] in the field of geo-information, using indexes such as NDVI, EVI may help achieve faster convergence.

The images used were at 10m/pixel resolution, there are images of higher resolution which provide finer details which are ideal for Plot Detection.

ReferencesRonneberger O, Fischer P, Brox T.

U-net: Convolutional networks for biomedical image segmentation.

In International Conference on Medical image computing and computer-assisted intervention 2015 Oct 5 (pp.


Springer, Cham.

Dstl Satellite Imagery Competition, 3rd Place Winners’ Interview: Vladimir & Sergey (http://blog.


com/2017/05/09/dstl-satellite-imagery-competition-3rd-place-winners-interview-vladimir-sergey/)Bergstra J, Bengio Y.

Random search for hyper-parameter optimization.

Journal of Machine Learning Research.


Remote Sensing Indices — https://www.



phpThank you to Prof.

Zeynep Akata (University of Amsterdam), Rob van der Zanden (DLL Eindhoven) and Gerbert Roerink (Wageningen University and Research) for enabling the project.


. More details

Leave a Reply