Clean Up your own Model Data without leaving Jupyter

Clean Up your own Model Data without leaving JupyterUsing the new Innotater tool to annotate data for a better predictive model.

Dan LesterBlockedUnblockFollowFollowingApr 25Many machine learning projects start with a Jupyter notebook, and the first few lines of the notebook load the training data.

Beyond a quick sanity check that the data looks about right, it can be disruptive to the flow of your programming if you need to step back out of the notebook to clean up or annotate your data.

This article introduces a new open source tool, Innotater, that provides an interactive Jupyter widget allowing a developer to annotate their data directly inline within the notebook.

Having an easy way to clean and augment data can quickly lead to better predictive models.

We will use this approach in a computer vision task to, first of all, manually filter out images that should never have made it into our dataset in the first place, improving a simple butterfly classifier.

We then use Innotater to draw butterfly bounding boxes on a subset of our images.

This allows us to train a bounding box prediction model that we run on every image in our dataset so we can generate cropped, zoomed-in versions of the originals.

Then we go on to train a new classifier with an improved accuracy since the model doesn’t need to consider as much irrelevant background imagery.

Innotater widget embedded in a Jupyter NotebookThis article is aimed at readers with some existing understanding of computer vision models in Jupyter notebooks who may be interested in finding out about the tools for annotating images, and also the ‘trick’ used to improve the model by drawing our own bounding boxes manually and writing a new model to zoom in on that.

All the code is available in Jupyter Notebooks on GitHub here so you can follow along.

It is written in Python and uses fast.

ai which is a framework that sits on top of PyTorch.


ai keeps things at a pretty high level and provides best practice defaults, so the deep learning code is light.

Obtaining the Raw DataThe task of building a classifier to identify two different butterfly species is borrowed from a previous Towards Data Science article by Bert Caramans.

He wrote a script to download photos from Flickr that are tagged either “Gatekeeper” or “Meadow Brown” — two butterfly species that are notoriously confused when volunteers attempt to count butterfly populations in the wild for conservation purposes.

Our first notebook downloads the images from Flickr.

Instead of the usual ‘one folder per class’ file storage model, we actually place all images in one folder and build a CSV file listing the supposed class (based on Flickr tag) of each image.

The reasons for doing things this way are so that we can extend the CSV to record our manually-drawn bounding boxes, and also easily change the class of the image if we find that the incorrect species has been identified.

We can do these things within the Innotater then easily save the modified classes and bounding boxes back to the CSV.

Filtering and Annotating DataThe Innotater tool is designed to be a quick and easy way to step through your images and mark important facts or augment data on each image.

There are a few things we want to note about each image:Is the classification correct?.If the wrong species label has been assigned to the image, we want to change it easily.

Does this image belong in our dataset in the first place?.In some cases, albums have been tagged with the butterfly species but not all images are really of butterflies at all.

So we want to remove them from our dataset.

Bounding boxes around the butterfly.

We won’t take the trouble to do this for all images, but at least for some images we will draw a tight bounding box around the butterfly.

Bounding boxes will allow us to build a more accurate model at a later stage, although to build our first simple classifier we only actually need data from the first two points above.

The second notebook is used to perform these steps.

Please note that you can’t see the Innotater widget in GitHub previews, so hopefully the screenshot below will bring it to life.

We use Pandas to read in the CSV and extract three NumPy matrices that will be fed into the Innotater.

classes is a NumPy list of 0's or 1's specifying whether each image in the dataset is (currently) the Gatekeeper or Meadow Brown species.

excludes is also an array of 0’s or 1’s.

It starts off as all 0’s but we may turn an entry to 1 in order to exclude the corresponding image from our dataset going forward.

bboxes is a four-column matrix containing x,y,w,h of our bounding box for each image, where (x,y) is the top left corner of the box, w is the width, and h is the height.

These all start as 0’s until we draw any boxes manually.

Each row of the matrices above corresponds to the ‘filename’ column in the Pandas DataFrame which we loaded from the CSV.

We’re nearly ready to invoke the Innotater (at home you’ll need to pip install jupyter_innotater too!), but first we need to think about the order in which we step through the images.

Mixing Things UpDue to the way the CSV was created, we have nearly 500 images of Meadow Brown butterflies at the start of the file, and then nearly 500 Gatekeeper butterflies in the second half.

Since we’re not expecting to draw bounding boxes on every single image — maybe just 200 or so — if we step through in the default order then this causes a problem.

If we draw bounding boxes on each image we see until we’ve annotated enough then we’ll only have bounding boxes on a subset of the first butterfly species.

Gatekeeper butterflies won’t have any!So in cell number 7 we use a bit of Python/NumPy manipulation to create a new mapping called indexes which specifies a new ordering based on index numbers.

The new ordering shows the first Meadow Brown, then the first Gatekeeper, then the second Meadow Brown, and so on…Invoke the Innotater!In order to view and edit all important aspects of the dataset, this is how we cause the user interface of the Innotater to appear:The syntax for launching the Innotater widget is designed to be simple and flexible.

The format is Innotater(inputs, targets, indexes=indexes) where inputs and targets are arrays (or just single items) of special Innotation objects which are essentially wrappers around your dataset’s matrix representations.

Generally, inputs wrap data on the ‘x’ side of your data science problem and don’t expect to be altered; targets are the ‘y’ side and may need changing — for example, changing the classification or entering bounding box data.

Innotation classes are flexible in terms of the data format you provide; you just need to make sure you pick the right subclass of Innotation for the type of data.

For example, the images themselves (the ‘x’ side of our machine learning task) just need to be wrapped in an ImageInnotation object like this: ImageInnotation(filenames, path='.


The path argument is optional if your filenames are already absolute or relative to the working folder, and in fact you don’t need to supply filenames: you can provide already-loaded matrices, perhaps imported using open_image in Open CV2.

On the targets side, we use BinaryClassInnotation(excludes) to represent the 0’s and 1’s of the excludes array as a checkbox next to each image.

The excludes variable isn’t really on the ‘y’ side of our problem, but we want to be able to edit it, and we’ll use it to filter out images where excludes==1 going forward.

Genuine ‘y’ side targets include the classification of butterflies, turned into a listbox component throughMultiClassInnotation(classes, classes=cats).

Note we could have used BinaryClassInnotation here again since we only have two classes (0 or 1), but a checkbox doesn’t feel right to switch between two different species (‘check the box for Gatekeeper, uncheck for the other species’), and a listbox approach scales if we want to add more species in future.

The classes variable itself can be in many forms: a simple Python list of 0’s and 1’s, a NumPy column vector, or a two-dimensional one-hot encoding of the data.

The Innotater inspects your data and works with it accordingly.

The most interesting Innotation class we’re going to use is perhaps BoundingBoxInnotation(bboxes) which initially displays as a single text box where we can enter each box’s (x, y, w, h) shape as a comma-separated list of numbers.

Even better, it automatically connects to the ImageInnotation we provided in inputs so that we can draw the box over the image itself and have our bounding box co-ordinates set automatically to represent the shape we’ve drawn!The full code to instantiate the widget is:Innotater( ImageInnotation(df['filename'], path=IMAGE_FOLDER, height=300, width=400), [BoundingBoxInnotation(bboxes), BinaryClassInnotation(excludes, name='Exclude'), MultiClassInnotation(classes, classes=cats, dropdown=True) ], indexes=indexes)Using the Next/Prev buttons you can step through each image and draw boxes, change classes, or check the ‘excludes’ checkbox.

As you do so, the underlying Python variables bboxes, classes, and excludes will update instantly.

So at any point, in our notebook we can visit cell 12 below the widget, set the updated variables back into the Pandas DataFrame (the variable called df) and write the CSV file to disk:df[['x','y','w','h']] = bboxesdf['exclude'] = excludesdf['class'] = [cats[i] for i in classes]# And save the full Pandas data back to a CSV filedf.

to_csv(BUTTERFLIES_BBOXES_FILEPATH, index=False)The way the notebook is set up means we can come back in a different notebook session, load the latest CSV values in, and continue annotating.

As long as you explicitly save the CSV in each session you don’t have to annotate all the data in one sitting.

Catching ButterfliesNow we’ve inspected and annotated our data, let’s do something with it!.There are three notebooks involved in this section, numbered 3 to 5 in the butterflies GitHub repo.

Basic TrainingFirst of all, in 3 – Basic Train.

ipynb, after eliminating any images marked as ‘exclude’ by ourselves in the Innotater, we just train a basic classifier model.

This is ‘cats or dogs’ in the canonical machine learning tutorial.

The fast.

ai framework does so much of this for us that there really is very little machine learning code here.

Most code is boilerplate borrowed from fast.

ai examples.

The code is commented to explain what’s happening: loading the CSV into a Pandas DataFrame, using that to make a ‘DataBunch’ object containing train and test datasets, then using that object to provide training data to a pre-trained ResNet50 model.

It uses an Adam optimiser to train with most existing layers frozen for 10 epochs; then the model is ‘unfrozen’ so that all layers can be fine-tuned for a further 5 epochs.

Choosing our validation set is something that needed some thought in this project.

Reserving 20% of our dataset for validation purposes seems sensible, and there is a fast.

ai function to do so by random selection.

But this leads to ‘data leakage’ — images in the validation set might be very similar to images in the training set, allowing the model to ‘cheat’ by clinging to irrelevant artefacts of those images.

This happens because images from the same Flickr album will typically sit sequentially in the DataFrame.

So a safer approach is to take the first 80% of each class’ images for training and leave the remaining 20% for validation.

This way, in the worst case we only split one album across train and validation sets.

Training results in the basic model: accuracy at end of 5th epoch is 0.

82With the basic model we end up with an accuracy of 82%.

Not too bad!Bounding Box ModelThe whole intention of this project was to see if drawing our own bounding boxes could help us build a better model.

The theory was that building a model to predict tight bounding boxes, showing exactly where the butterfly features in the image, would mean we can crop and zoom into the butterfly itself and hopefully train a classifier explicitly on the zoomed images.

For a new unseen butterfly photo, we would run our classification process in two stages: first, predict the bounding box so we can zoom in on the butterfly; secondly, run our classifier on the zoomed image.

These two stages are developed in the final two notebooks, numbered 4 and 5 in GitHub.

The first part of notebook 4 – BBox Train and Generate.

ipynb works very similarly to notebook 3 in that it uses similar infrastructure to train a model.

In this case, we’re predicting bounding boxes instead of just ‘0 or 1’ classification, so it’s a bit more involved.

To build the model, we first remove any images where we didn’t get round to drawing a bounding box — remember we never intended to annotate all images.

We also have to write our own fast.

ai classes to handle bounding boxes (fast.

ai’s own infrastructure for this wasn’t quite ready at the time of writing).

Optimisation is similar, but we use the L1 loss measurement (sum of absolute horizontal and vertical distances between target and predicted co-ordinates) in order to see how well the model is performing against our manual drawings.

The notebook shows a few different attempts to try to get better bounding box predictions — it’s a bit messier than the previous notebook — but in any case, by the end we have some reasonable-looking bounding boxes.

We could do a lot better, and the boxes often cut off important butterfly markings that will probably be meaningful to the classifier in the next stage!.Anyway, you can certainly try to improve on this, but let’s keep going…The last cell in notebook 4 applies the model to all images in the CSV (except those with excludes marked as 1) in order to output our cropped and zoomed images.

At training time we could only make use of those images where a bounding box was present, but now we’ve trained the model we can apply it to every single image to get a full set of bounding box predictions.

The code runs through each image and generates a ‘zoomed’ version of the image based on those bounding box co-ordinates — each new image hopefully containing a nice big centred butterfly.

Our original set of images would have some butterflies featuring in a relatively small section of the overall image photo.

This introduces a lot of noise in the borders, and since our images are resized to 256 pixels square in the preprocessing of our ‘basic train model’, a lot more of the butterfly itself should find its way into the neural network’s layers if we train again on our zoomed images.

Zoomed and Cropped TrainingSince we already performed all the zooming and cropping for all images at the end of notebook 4, you’ll find that notebook 5 – Zoomed Cropped Train.

ipynb is pretty much an exact copy of notebook 3, except it runs on the new zoomed images (which were saved into a ‘zoomed’ subfolder).

It seems only fair to use the same training steps that we used when we trained the basic model before: we want to be able to compare the models to see if the model trained on zoomed images performs better.

By the end of training, we see an accuracy of 84% (up from 82% in the previous version).

That’s definitely going in the right direction!ConclusionIn truth, you could do a much better job training all of these models — my aim was never to teach you to successfully train computer vision neural networks.

It’s entirely possible that a single better-structured neural network could emulate some of the ‘zooming’ that takes place in our combined model.

The input data themselves are shaky since, for a relatively small dataset, we can only hope that Flickr users all take a consistent approach to taking photos then tagging and uploading them.

But I hope this project shows that the Innotater is a fun way to get your hands dirty with the data, and not only clean up your dataset but also encapsulate ‘human insight’ manually that might not otherwise find its way into your modelling process.

Ultimately, in this example we are already relying on humans to label the butterfly species manually, so why not take it a bit further yourself and teach your model what a butterfly looks like in the first place?At the time of writing, images, single bounding boxes, and the listbox/checkbox controls are the only available wrappers for Innotater data.

The ways in which they can be combined is already very flexible, but of course you might need other annotation types (maybe multiple bounding boxes and different shapes) depending on your project.

Please do get in touch describing the problems you’re facing with your data, or limitations in trying to use the Innotater, and further development can incorporate your ideas and solutions!.Further details can be found on the Innotater GitHub page.


. More details

Leave a Reply