Automatic Image Captioning : Building an image-caption generator from scratch !

Automatic Image Captioning : Building an image-caption generator from scratch !Garima NishadBlockedUnblockFollowFollowingMar 11This article will guide you to create a neural network architecture to automatically generate captions from images.

So the main goal here is to put CNN-RNN together to create an automatic image captioning model that takes in an image as input and outputs a sequence of text that describes the image.

But why caption the images?This is because image captions are used in a variety of applications.

For example:It can be used to describe images to people who are blind or have low vision and who rely on sounds and texts to describe a scene.

In web development, it’s good practice to provide a description for any image that appears on the page so that an image can be read or heard as opposed to just seen.

This makes web content accessible.

It can be used to describe video in real time.

A captioning model relies on two main components, a CNN and an RNN.

Captioning is all about merging the two to combine their most powerful attributes i.


CNNs excel at preserving spatial information and images, andRNNs work well with any kind of sequential data, such as generating a sequence of words.

So by merging the two, you can get a model that can find patterns and images, and then use that information to help generate a description of those images.

How will it train though?The answer is COCO dataset.

COCO stands for common objects and contexts and it contains a large variety of images where each image has a set of about five associated captions.

Say you’re asked to write a caption that describes this image, how would you approach this task?Based on how these objects are placed in an image and their relationship to each other, you might think that a dog is looking at the sky.

He is smiling so he might as well be happy.

Also he is outside so he might as well be in a park.

The sky isn’t blue so it might as well be sunset or sunrise.

After collecting these visual observations, you could put together a phrase that describes the image as, “A happy dog is looking at the sky”.

CNN-RNN ModelFeed an image into a CNN.

We’ll be using a pre-trained network like VGG16 or Resnet.

CNN Model (Image Courtesy: Google)Since we want a set of features that represents the spatial content in the image, we’re going to remove the final fully connected layer that classifies the image and look at earlier layer that processes the spatial information in the image.

RNN ModelSo now CNN acts as a feature extractor that compresses the information in the original image into a smaller representation.

Since it encodes the content of the image into a smaller feature vector hence,this CNN is often called the encoder.

When we process this feature vector and use it as an initial input to the following RNN, then it would be called decoder because RNN would decode the process feature vector and turn it into natural language.

Tokenizing CaptionsThe RNN component of the captioning network is trained on the captions in the COCO dataset.

We’re aiming to train the RNN to predict the next word of a sentence based on previous words.

For this we transform the captioned associated with the image into a list of tokenize words.

This tokenization turns any strings into a list of integers.

So how does this tokenization work?First, we iterate through all of the training captions and create a dictionary that maps all unique words to a numerical index.

So, every word we come across, will have a corresponding integer value that we can find in this dictionary.

The words in this dictionaries are referred to as our vocabulary.

Tokenization of caption (Image Courtesy: Google)This list of tokens (in a caption )is then turned into a list of integers which come from our dictionary that maps each distinct word in the vocabulary to an integer value.

Assigning Integer value to TokensEmbedding Layer?It transforms each word in a caption into a vector of a desired consistent shape.

After this embedding step, we’re finally ready to train an RNN that can predict the most likely next word in a sentence.

To process words and create a vocabulary, we’ll be using the Python text processing toolkit: NLTKNow RNN has two responsibilities:Recollect spatial information from the input feature vector.

Predict the next word.

Training SetupBegin by setting the following variables:Train your ModelTo figure out how well your model is doing, you can look at how the training loss and perplexity evolve during training.

 However, this will not tell you if your model is overfitting to the training data, and, unfortunately, overfitting is a problem that is commonly encountered when training image captioning models.

Load Trained ModelsYou will load the trained encoder and decoder.

Wanna know how to do this ?Then check this link here.

Generate Predictions!get_prediction() can be used to use to loop over images in the test dataset and print model’s predicted caption.

Call the function & there you have it !.Your own image caption generator.

Want a full source code for this project ?.Refer here : Automatic Image Captioning.

. More details

Leave a Reply