Is a Picture Worth A Thousand Words?Mouhamed NdoyeBlockedUnblockFollowFollowingDec 16, 2018Source: Dark ReadingBackgroundOur project was inspired by Jamie Ryan Kiros who created a model trained on 14 million romance passages to generate a short romantic story for a single image input.
Similarly, the ultimate goal of our project was to output a short story for children.
“neural-storyteller is a recurrent neural network that generates little stories about images” — Jamie Ryan KirosReference: https://github.
com/ryankiros/neural-storytellerProject GoalsUpdate code from Python 2 to Python 3+Recreate the original project which was trained on romantic novelsScrape children’s storiesTrain skip-thought decoder through RNN on children’s stories in order to create style bias dataGather image labels and captions from Microsoft Azure Vision APIOutput a short story for children based on the input of a single imageCompare output for romance and children’s story genresData CollectionThe initial plan was to scrape children’s books (in PDF form) from SmashWords (as recommended by Jamie), but we faced challenges getting Python libraries textract and PyPDF2 to extract all words from a given PDF.
Instead, we focused on scraping websites which had stories that were readily-available in HTML text formatting.
Using the Web Scraper extension via Google Chrome, we were able to scrape children’s stories from several sources.
Here’s an example of a selector graph we used to navigate throughout a site’s web pages:Example: Web Scraper selector graphOnce we aggregated the data into a single file, we ran it against nltk to parse the data into sentences.
This approach increased our dataset size to almost 45,000 sentences.
The following websites were scraped as part of our data gathering process:Storyberries: https://www.
com/category/10-min-stories/American Literature: https://americanliterature.
com/short-stories-for-childrenStudent UK: https://www.
com/category/bedtime-stories/Tonight’s Bedtime Story: http://www.
com/stories/ResourcesWe were able to set up Jupyter Notebook a Google Cloud instance by following Amulya Aankul’s tutorial.
NOTE: We increased our disk image size to 100GB after experiencing memory leaks with the 10 GB disk.
Model SetupTo make the new model output a short story for children, we trained the Recurrent Neural Network (RNN) decoder on children’s stories.
Slightly different from Jamie’s model, each sentence was mapped to a skip-thought vector which then generates a few sentences that the RNN conditioned on the skip-thought vector.
In Jamie’s model trained on romance novels, she uses passages but, given our significantly-smaller dataset, we broke down the passages into sentences in order to augment our dataset.
Source: Google BlogDuring the model setup, we faced challenges while training Common Objects in Context (COCO) images and captions.
The image above details how the Vision Deep CNN generates captions using a language-generating RNN in the final layer.
Given our time constraints for this project, we decided to use Microsoft Azure’s Vision API which generated several image labels and a single caption.
We, then, concatenated the image labels into a single caption so that we could input two captions.
We have provided two examples later in this post.
In Jamie’s model, the COCO dataset provides five captions.
We hypothesize that this difference in number of captions impacted our model’s ability to develop better grammar and more appropriate styling for a children’s story.
The function referenced above allowed us to connect the captions from the Vision API to the short story generated by the model.
The formula was constructed using three different vectors:X = an image captionC = “caption style” vectorB = “book style” vectorAs detailed by Jamie in her project, skip-thought vectors are sensitive to the following factors:Length — In our case, our shorter sentences (when compared to passages) impacted the length of the short story.
Generally-speaking, the word count of a children’s story output was about a third with that of a generated romantic story.
Punctuation — We noticed that punctuation (for both models) was limited to periods, commas, and question marks.
This may be a coincidence in how the different datasets were parsed into passages/sentences.
Vocabulary — We hypothesized that there would be a noticeable difference in vocabulary given that the target audiences for romance and children’s stories are quite different.
In examples below, it’s evident that the vocabulary using in the romantic stories is more matureSyntactic style — Similarly, the syntactic style was largely-based on the passages/sentences used to train the models.
In our testing, there weren’t any noticeable differences in styling between the romance and children’s stories.
WorkflowFinally, we were able to package all of the components into a single model.
Submit an image to Microsoft Azure Vision APIMicrosoft Azure Vision API outputs image labels and a caption: In this step, we concatenated the image labels so that they would be presented to the model as a captionSubmit image labels & the caption into model: With two captions, one provided by the Vision API and the other concatenated from image labels, we submit them to the model along with the style (or story genre in this project)Model outputs short storyResultsGenerally-speaking, the amount of content for romantic stories (generated by the model) were much higher than that of short stories for children.
Images of the Friends cast and pizza were submitted as part of our test.
See the outputs below:Source: The IndependentFriends: Concatenated Image Labels & Caption[‘person people group posing photo standing sitting window woman man holding suit cake food table’, ‘Matt LeBlanc, Lisa Kudrow, Courteney Cox, David Schwimmer, Jennifer Aniston posing for a photo’]Friends: Romantic StoryJess was thinking it would never be my fault , she smiled and let Kat touch her emotions again , before she turned her head and let him step into the shower as soon as she let him touch her , Anna quickly turned her attention back to me , knowing he could be the one time before she stepped into my penthouse suite , making Liv blush again .
How many times do I have to figure out why my mom was going to marry him ?.Don t be thrilled that she was my favorite piece of jewelry .
The photo photo photo album was perfect , but it felt perfect .
She pointed her index finger at me and pointed her index finger to my lips .
Don t change my suit .
The only other woman I ve ever met was sexy and sexy as the other two male members of the wedding ceremony .
And my mom , Nat , Mel Miller , who seemed to believe she d chosen the perfect brunette out there .
Friends: Children’s StoryBut, if I had ever told her mother to tell him the truth, if she had been looking out of her mind, the young man, who had never been looking out of her father, who was obliged to take care of her father, for the sake of her father, who was obliged to take care of her father, I should never see if she had eaten up her eyes, for the sake of the kingês daughter, who was obliged to take care of her father, who had nothing left her temper,Source: Expert PhotographyPizza: Concatenated Images Labels & Caption[‘photo pizza food different sitting box table black couple plate various meat toppings laying pan purple’, ‘a pizza sitting on top of a pan’]Pizza: Romantic Storypizza a drink and I found myself cuddling a little too much , not caring how much I thought it would be my turn to put myself together like she was born to wear .
Dark hair fell over one eye , making me feel sick even worse .
I was seated at a table placed on top of the pillows , folding it neatly into a pile at my bedside table .
With a dramatic change , I had no intention of making a appearance at all .
I was not going to be a very cute young couple .
I was only nine months old now , and I hate to admit it .
At the same time I placed the pan on top of the pillows holding a pan , then pointed it at the adorable kitten lying flat on my bedside table .
We could no longer read each other , and soon I d bite my bottom lip .
Pizza: Children’s StoryFor a few minutes I was obliged to sit down on the top of the wood, and gave him a pair of scissors to make a drink out of my bed, I would give you a bit of water to make a drink out of the pot, and a bit of wine, I was obliged to eat a piece of bread on a string, so that I am going to eat her lips,ChallengesFinding the appropriate amount of training data.
In retrospect, we should have picked a story genre which would have a significant amount of dataGrooming the data (removing blank rows, encoding, etc.
)Setting up Google Cloud instancesFinding and scraping children’s stories from several sourcesMatching the encoding of the datasetsRecreating the original project with newer versionsSetting up Google Cloud instanceTraining the modelOriginal project used COCO image captioning to generate captions for imagesTakeawaysExplore data sources before committing on a single genre.
In our case, we thought we’d be able to find significant data to train the model on children’s stories, but there were only so many websites to scrape that were freeMore data, the better (the model).
Jamie’s original model was trained on 14 million passages while our model relied on 48k sentences.
GPU over CPU.
We spent a significant amount of time setting up our model on our local machines when we could have started with the Google Cloud instance earlier in the project.
Future WorkIn reflection, we’ve discussed a few possibilities to improve our model.
First, an interesting addition suggested by our professor, Dr.
Alex Dimakis, was to make sure that each word set as an input (in the captions) would be included in the generated short story.
Another thing to consider is the ordering of model training.
Given that the original model was trained on romantic stories, first, we hypothesize that the grammar and styling of the short story outputs will be styled closer to children’s stories if we trained romance stories last.
This may also be the case because of the great disparity in size of the datasets.
As mentioned earlier, the romance novel dataset contained more than 14 million passages while our children’s story dataset had 48,000 sentences.
In tandem, it may be worth finding another source for better captions.
Our current model concatenates the image labels to serve as captions for the inputs, but we’ve found that the approach impacts the grammar and styling of the short story output.
We believe this to be the case because our concatenated caption was simply a collection of nouns.
Overall, we enjoyed learning about the original project and accomplishing our project goals.
You can find our code repository here.
ReferencesWe couldn’t have completed this project without the great people of the Internet.
Please check out the following resources for your next data science project:https://ai.
06576By Abhilasha Kanitkar, Anuja Srivastava, and Mouhamed Ndoye.