Well, the GPT-2 is based on the Transformer, which is an attention model — it learns to focus attention on the previous words that are the most relevant to the task at hand: predicting the next word in the sentence.
Let’s see where GPT-2 focuses attention for “The dog on the ship ran”:The lines, read left-to-right, show where the model pays attention when guessing the next word in the sentence (color intensity represents the attention strength).
So, when guessing the next word after ran, the model pays close attention to dog in this case.
This makes sense, because knowing who or what is doing the running is crucial to guessing what comes next.
In linguistics terminology, the model is focusing on the head of the noun phrase the dog on the ship.
There are many other linguistic properties that GPT-2 captures as well, because the above attention pattern is just one of the 144 attention patterns in the model.
GPT-2 has 12 layers of transformers, each with 12 independent attention mechanisms, called “heads”; the result is 12 x 12 = 144 distinct attention patterns.
Here we visualize all 144 of them, highlighting the one we just looked at:Visualization of the attention patterns across the model’s 12 layers (rows) and 12 heads (columns), with Layer 4 / Head 3 selected (zero-indexed).
We can see that these patterns take many different forms.
Here’s another interesting one:This layer/head focuses all attention on the previous word in the sentence.
This makes sense, because adjacent words are often the most relevant for predicting the next word.
Traditional n-gram language models are based on this same intuition.
But why do so many attention patterns look like this?In this pattern, virtually all attention is focused on the first word in the sentence, and other words are ignored.
This appears to be the null pattern, indicating that the attention head hasn’t found whatever linguistic phenomenon it is looking for.
The model seems to have repurposed the first word as the place to look when it has nothing better to focus on.
The Cat in the _____Well, if we’re going to let NLP taint our memories of Sesame Street, then I guess Dr.
Seuss is fair game as well.
Let’s see how GPT-2 finishes these lines from the timeless classic, Cat in the Hat:On the string of one kite we saw Mother’s new gown!.Her gown with the dots that are pink, white and…Here’s how GPT-2 completed the last sentence:Her gown with the dots that are pink, white and blue.
Not too bad!.The original text has red, so at least we know it’s not just memorizing.
So how did GPT-2 know to choose a color?.Perhaps due to the following attention pattern that seems to recognize comma-separated lists:To decide the word after and, the model focuses attention on the preceding items in the list — pink and white.
It knew to pick a word that matched the type of the previous items, in this case a color.
What’s in a name?GPT-2 seems particularly adept at writing short biographies based on solely on a name.
As a interesting exercise, try having GPT-2 generate text from the prompt “Who is <your name>?” This particular prompt generally triggers the model to write a short biography, likely because it is a common preface for author bio’s in articles on the Web.
Here are a couple of bios generated for the prompt “Who is Jesse Vig?”:“Jesse Vig is a social media marketing expert and former social media marketing manager.
He is the co-founder and CEO of VigMedia.
com and recently the founder and CEO of VigMedia.
”Not bad!.A little bit repetitive, but does a nice job of personalizing the story in an unexpected way.
Here’s another:“Jesse Vig is the son of an evangelical preacher named James Vig.
He moved to the United States in 1964 and became a preacher at the University of Michigan, where he taught for 18 years until his death in October 2007.
”In this last example, GPT-2 was smart enough to know that the father of my alter ego had the same last name.
Let’s see where GPT-2 focused its attention when picking this last name:Layer 11 / Head 10When deciding the word to predict after James, this pattern focuses attention on previous mentions of my last name.
(Note that, internal to the model, Vig has been broken into word pieces “V” and “ig” because it is an uncommon word.
) It seems that this attention pattern specializes in identifying relationships between familial names.
To test this, let’s change the text slightly:“Jesse Vig is the colleague of an evangelical preacher named James…”Layer 11 / Head 10Now that James is just a colleague, this attention pattern ignores my last name almost entirely.
Note: GPT-2 seems to generate biographies based on the perceived ethnicity and sex associated with a name.
Further study is needed to see what biases the model may encode; you can read more about this topic here.
The future is generativeIn just the last year, the ability to generate content of all kinds— images, videos, audio and text — has improved to the point where we can no longer trust our own senses and judgment about what is real or fake.
And this is just the beginning; these technologies will continue to advance and become more integrated with one another.
Soon, when we stare into the eyes of the generated faces on thispersondoesnotexist.
com, they will meet our gaze; they will talk to us about their generated lives, revealing the quirks of their generated personalities.
The most immediate danger is perhaps the mixing of the real and the generated.
We’ve seen the videos of Obama as AI puppet and the Steve Buscemi-Jennifer Lawrence chimera.
Soon, these deepfakes will become personal.
So when your mom calls and says she needs $500 wired to the Cayman Islands, ask yourself: Is this really my mom, or is it a language-generating AI that acquired a voice skin of my mother from that Facebook video she posted 5 years ago?But for now, let’s just enjoy the stories about talking unicorns.
Resources:Colab notebook for visualization toolGitHub repo for visualization toolIllustrated Transformer tutorialHuggingFace’s Pytorch implementation of GPT-2Original Tensor2Tensor visualization tool, created by Llion Jones.
Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters.