Mario vs. Wario — round 2: CNNs in PyTorch and Google Colab

Mario vs.

Wario — round 2: CNNs in PyTorch and Google ColabEryk LewinsonBlockedUnblockFollowFollowingJan 27Since quite some time I was getting round to playing with Google Colab (yes, free access to GPU…).

I think this is a really awesome initiative, which enables people with no GPU on their personal computers to play around with Deep Learning and train model they would not be able to train otherwise.

Basically we have a 12h window to play around and then the VM dies.

But we can of course start up a new one and there are ways to continue your work from previous sessions.

In this article I would like to present an extension of my previous work.

This time however, I will build a CNN using PyTorch and train it on Google Colab.

In the end I hope to achieve better results than previously! Let’s start :)1.

Setting up Google ColabThere are a few good articles already on Medium regarding how to start your adventure with Google Colab, how to enable GPU etc.

I wanted to show a few useful commands for inspecting what kind of hardware/software we are actually working on:We see that we are working on a Tesla K80 and Cuda 9.

2 is already installed.

This makes things much easier!It was was not that easy to find how to efficiently work with larger datasets stored on Google Drive.

Lots of courses and posts use built-in datasets from either PyTorch or other libraries.

But at first I found it a bit tricky to work with my own set of images.

So I did the following:Upload the datasets (zipped files with train/test folders) to Google Drive.

This can be easily done via Drive UI.

The initial directory tree looked something like this:mario_vs_wario/ training_set/ mario/ mario_1.

jpg mario_2.

jpg .

wario/ wario_1.

jpg wario_2.

jpg .

test_set/ mario/ mario_1.

jpg mario_2.

jpg .

wario/ wario_1.

jpg wario_2.

jpg .

Mount Google DriveWhen using Colab it is important to store files in a Colab directory, not on the mounted Google Drive.

The cell below contains code for connecting to Google Drive and mounting the drive, so that we can access all files stored there.

However, training Neural Networks (even with GPU enabled) with data loaded from Google Drive will most of the times be significantly slower than training it locally on a CPU.

That is due to copying all the data between Colab and Drive directories, which is incredibly slow.

Move the zip files from my Google Drive (via sharable link) to directories created in the Colab environment and unzip.

To tackle the above-mentioned problem, I zip the training and test set separately and download the files by using gdown and the link to Google Drive (when you click download shareable link in Drive's UI).

Then I unpack the folders with images to the designated directories.

In the last step I remove a leftover directory.


Loading DataIn this part I load and pre-process the data (images).

I will describe the process step by step:First, I define some parameters and the transformations I want to carry out on the images (resize to 128×128, convert to tensors and normalize).

This is also the step where I could carry out image augmentation (random cropping, shearing, rotations etc.


However, as this particular problem is about classifying video games’ images, I think it does not make sense to apply these transformations, as the images will no longer resemble the original screenshots.

But if you are building a cat/dog classifier and do not have a really large dataset (and even if you do), this would be the place to apply the transformations.

I specify the directory for train/test data and apply the selected transformations.

I randomly select a subset of indices from the training set to use them for validation.

I also create SubsetRandomSamplers for sampling images from given indices (not the entire dataset).

I create DataLoaders by combining datasets with samplers.

I use pin_memory = True in case of training on a GPU (recommended setting).

For the test_loader I also shuffle the dataset, otherwise it would first take all observations from one class, then all from the second one, without any reshuffling.

In case of the test set, this actually does not matter.

But it is good to be aware of this functionality.

In the code below I inspect 10 randomly selected images.

As DataLoaders work as iterators, I first use iter() followed by next() to obtain randomly selected images and their labels (from the first batch).


CNN architectureI present two approaches to defining the architecture of Neural Networks.

The first one is by building a class which inherits from nn.


The second one is more similar to Keras and we create a sequence of layers.

There is no right or wrong here, it all depends on personal preferences.

In both approaches I use the same architecture, so only one should be used before training.



Class ApproachI define a class inheriting from nn.

Module, which combined with super().

__init__() creates a class that tracks the architecture of the neural network and provides a variety of methods and attributes.

It is important to note that the class must inherit from nn.


The class must contain two methods: __init__ and forward.

I give a little more explanation on each of the required methods:__init__ – it is used to define the attributes of the class and populate specified values at initialization.

One rule is to always call the super() method in order to initialize the parent class.

Aside from this, we can define all the layers which have some parameters to be optimized (weights to be adjusted).

We do not need to define activation functions, such as relu here, because given the same input they will always return the same output.

The order of defined layers does not matter, as these are purely definitions, not the architecture specifying how the layers are connected.

forward – In this method we define the connections between layers.

We specify the order in which they are connected and ultimately return the output of the network.

On a side note, the variable does not have to be called x, what matters is it passing through the layers in the correct order.



Sequential ApproachThe Sequential approach might be immediately familiar for those who used Keras.

I create an OrderedDict with each layer specified in the order they are to be executed.

The reason for using OrderedDict is that I can give the layers meaningful names.

Without doing so, their names would be integers.

At the beginning I define a Flatten class, which basically reshapes the matrix into a long vector, as it is normally done in case of CNNs.

The OrderedDict is placed within nn.

Sequential, which defines our model.


Loss function and optimizerThe first step is to move the model to Cuda in case it will be trained on a GPU.

Then I specify the loss function for a binary classification problem and the optimizer as Stochastic Gradient Descent with a learning rate of 0.



Training the networkThere is already a lot of material online about the steps one needs to take in order to train neural networks.

I will only outline the steps:Forward pass through the network (as specified in the forward() method)Calculate the loss based on network’s outputBackward pass through the network with loss.

backward() to calculate the gradientsUpdate the weights by taking a step with the optimizerThere are also a few other things worth mentioning:optimizer.

zero_grad() – when doing multiple backwards passes with the same parameters, the gradients are accumulating.

That is why we need to zero the gradients on each forward pass.



eval() – while training we might be using dropout to prevent overfitting.

However, for prediction/validation we want to use the entire network, thus we need to change the dropout probability to 0 (switch it off) by using model.


To go back to training mode we use model.



no_grad() – turn off gradients for validation, saves memory and computationsTo have a re-usable framework for training CNNs I encapsulate the logic within a function.

I assume that the network will be trained with a training and validation loss.

Of course, it could be further parametrised and the validation set could only be considered if the parameters are not None.

However, for the case of this notebook I believe this is enough.

Then training the model comes down to:I inspect the plot presenting the evolution of training/valuation loss over epochs.

Our goal is not only to reduce the training loss, but also to reduce the validation loss.

If the training loss continued to decrease while the validation loss increased, we would observe overfitting — the model would not be able to generalise well on data not seen during training.

In this case we see that the model’s losses do not decrease significantly after 7th epoch (or even earlier, depending on preferences).

In view of this, I will load the model from the 7th epoch.

By saving all the intermediate models I am able to see what the test set performance would look like (just in case I would like to compare).


Evaluating the results on a test setIn this part I evaluate the results of the network on the test set, i.


the one that the network has not seen during the training.

I write a similar script to the validation one, with the difference being the amount of metrics I store for evaluation.

Accuracy of 99%, sweet!.Let’s see some more detailed statistics:99.

2% Recall — This means that from all Wario screenshots in the dataset the model correctly predicts 99.

2% of them.


3% Precision — This means that from all the Wario predictions 99.

3% of them were actually Wario.


25% F1 Score — There is no clear interpretation for this one as F1 Score is the weighted average of Precision and Recall.

F1 is more useful than accuracy in case of uneven class distribution.

As in this case there is an equal number of Mario/Wario classes in the test set, accuracy = F1 Score.

The network in general does an amazing job in classifying the images.

Only 15 incorrectly classified images out of 2000.

To get more insight, we will inspect some of them below.

I have to say that it is not that weird that the network had troubles with these images.

Some of them are obviously transition frames from the game (loading screen between map and level or between screens).

There was no way to infer the correct game from those.

The rest of them are maps or a specific screen from Wario (third image).

The maps from these games were pretty similar, just as the characters seen from the isometric view.

I have to say that I am very satisfied with this network’s performance and with PyTorch in general.

It offers a lot of possibilities and is quite pythonic.

For more information about basics of PyTorch, I would refer you to Udacity’s free `Intro to Deep Learning with PyTorch` MOOC, which you can find here.

If you have any feedback regarding the article, please let me know in the comments.

As always, the entire notebook can be found on my GitHub repo.

.. More details

Leave a Reply