Teaching Neural Networks to Talk Like Painters PaintJesus RodriguezBlockedUnblockFollowFollowingMay 28Conversational interfaces and natural language understanding(NLU) are one of the areas of the deep learning ecosystem that has experienced the most advancements in practical implementations.
However, most NLU interfaces remain incredibly limited in terms of diversity and creativity of responses.
Try to ask the same question to your favorite digital assistant and you are likely to receive the same answer structured in the same repetitive way.
Enhancing the diversity and richness of conversational models is one of the challenges of the next generation of NLU applications.
Recently, researchers from Microsoft Artificial Intelligence(AI) Lab in Redmon published a paper proposing a method for enhancing the richness and relevance of diverse responses in conversational interfaces.
When we engage in a conversation, the same question can be answered in infinite ways using a combination of contextual, syntactical and semantical structures.
We are able to do that because our brains combine the richness of language with a canvas of possible answers given a specific context.
Think about it like a painters palette.
Painters don’t go to work using a small group of structured colors.
Instead, they mix paints of different colors that allow them to create a palette that offers a glimpse of the possibilities in front of them.
Using the same section of the palette in two consecutive attempts is likely to result on slightly different colors with enough of a variation to make the painting interesting.
Imagine if we could create the equivalent of a palette for a language conversation.
Instead of generating bland and repetitive responses, an AI algorithm will be able to create rich and original responses based on context, semantics and syntax.
Entering SpaceFusionSpaceFusion is the name of Microsoft’s proposed technique to enrich the diversity of responses in a conversational interface.
You can think of SpaceFusion as creating a palette in which, instead of colors, we have sections of human dialogs.
The palette of NLU models is based on the famous latent space which models the distributions of all relevant features in a given dataset.
Conceptually, SpaceFusion is a learning paradigm proposed to align and structure the unstructured latent spaces learned by different models trained over different datasets.
To illustrate the concept of a palette of latent spaces, let’s think about a NLU scenario that has been tackled using two different approaches: sequence-to-sequence(S2S) and variational autoencoder(AE) models.
Each type of model produces a different latent space that is completely disjointed.
A technique like SpaceFusion generates a heterogenous distribution for the features generated by both the S2S and the AE models.
The well-distributed latent space makes a wonderful picture, as illustrated in the previous image, but its really hard to achieve in practice.
Most NLU models suffer from a tradeoff between diversity and relevance which causes subsequent responses to the same question to slowly loose relevance.
Ideally, a well distributed latent space should provide two key properties:1) Disentangled space structure between relevance and diversity2) Homogeneous space distribution in which semantics changes smoothly without holes.
Using a more mathematical nomenclature for the two previous goals, we can model an NLU dataset as D = [(x0; y0); (x1; y1); ….
; (xn; yn)] where xi and yi are a context and its response, respectively.
A well-distributed latent space will be able to train a model on D to generate relevant and diverse responses given a context.
To accomplish the two aforementioned goals, SpaceFusion leverages a clever and yet simple architecture that combines both S2S and AE models.
In the SpaceFusion architecture, a S2S model that learns a context-to-response mapping using conversation data.
Complimentary, the AE model that utilizes speaker specific non-conversational data.
The decoders of S2S and AE were shared, and the two tasks were trained alternately.
Conceptually, the idea behind SpaceFusion is fairly intuitive.
For each pair of points from two different latent spaces, SpaceFusion first minimizes their distance in the shared latent space and then encourage a smooth transition between them.
This is done by adding two novel regularization terms — distance term and smoothness term — to the objective function.
The distance term measures the Euclidean distance between a point from the S2S latent space, which is mapped from the context and represents the predicted response, and the points from the AE latent space.
The smoothness term measures the likelihood of generating the target response from a random interpolation between the point mapped from the context and the one mapped from the response.
The combination of the S2S and AE models produces a solid distribution of the latent space in which in which the distance and direction from a predicted response vector roughly match the relevance and diversity, respectively.
Let’s illustrate this with a simple NLU scenario in which two friends are discussing the possibility of playing a specific game.
Given a context — in this case, “Anyone want to start this game?” — the positive responses “I’d love to play it” and “Yes, I do” are arranged along the same direction.
The negative ones — “I’m not interested in the game” and “No, I don’t” — are mapped on a line in another direction.
The geometrical distribution of the previous image is helpful to understand the advantages of SpaceFusion.
In that model, diversity in responses is achieved by exploring the latent space along different directions.
Furthermore, the distance in the latent space corresponds to the relevancy.
Responses farther away from the context — “Yes, I do” and “No, I don’t” — are usually generic, while those closer are more relevant to the specific context: “I’m not interested in the game” and “When will you?”.
SpaceFusion’s idea of achieving a better distribution of latent spaces makes a lot of sense conceptually but its harder to achieve practically.
In general, there are three key balances that SpaceFusion looks in an effective distribution of latent spaces:1) Direction vs.
Diversity: One of the key advantages of SpaceFusion is to distribute diverse responses homogeneously across the latent space.
In most NLU techniques, responses are typically packed across a narrow band which means one direction of the latent space give us only one type of answer.
2) Distance vs.
Relevance: In most NLU models, the relevance of responses decreases linearly with their order.
A well distributed latent space needs to maintain relevance of the answers for a decent number of iterations.
3) Homogeneity and Convexity: SpaceFusion looks to achieve both homogeneity and convexity of the latest space distribution.
If the space is not homogeneous, a model will have to sample differently depending on the regional traits.
If the space is not convex, a model will have to worry about running into the holes that are not properly associated with valid semantic meanings.
Microsoft evaluated SpaceFusion across different conversational datasets and the results were self-explanatory.
Look at this following two examples that benchmark SpaceFusion against other NLU models using the famous Reddis and Switchboard datasets.
Notice the how the relevant of the responses in the other models rapidly declines while SpaceFusion’s answers remain largely in-context.
Techniques such as SpaceFusion are required to continue expanding NLU applications into more sophisticated scenarios.
Initial tests, showed a great balance between relevance and diversity of responses which are a result of a robust distribution of the latent space.
Together with the research paper, Microsoft open sourced a Keras-based implementation of the SpaceFusion model in GitHub.