Exploring PlaNetGoogle-DeepMind’s Solution for Long Term Planning in Reinforcement Learning AgentsJesus RodriguezBlockedUnblockFollowFollowingFeb 18Planning has been long considered one of the cognitive abilities of the human mind that is nearly impossible to replicate by artificial intelligence(AI).
Some neuroscientists even relate to future planning as one of the key characteristics of human consciousness.
Planning does not only requires understanding a specific objective but also projecting that objective onto an environment whose characteristics are unknown in the present.
Humans are able to plan not only because we are able to understand a specific task in detail but because our ability to understand our surrounding environment enough that we can project the outcome of that task in the future.
In the context of AI, reinforcement learning is the discipline that has been trying to build long-term planning capabilities in AI agents.
Recently, AI researchers from Google and DeepMind joined forces to work on a Deep Planning Network(PlaNet), a new reinforcement learning model that can learn about the world using images and utilize that knowledge for long-term planning.
When comes to planning, reinforcement learning models can be divided in two main groups: model-free and model based.
In its most basic form, reinforcement learning models focus on mastering specific tasks by mapping rewards to a given action.
This is typically known as model-free reinforcement learning and has been the foundation behind systems such as DeepMind’s DQN which mastered Atari games.
Model-Free reinforcement learning typically requires a large number of simulated training sessions in order to map actions to sensory inputs which often results limited for long-term planning strategies.
Model-based reinforcement learning is the best-known alternative to model-free architectures and has been the foundation behind major breakthroughs in reinforcement learning such as Open AI Dota2 agents as well as DeepMind’s Quake III, AlphaGo or AlphaStar.
Contrasting with model-free approaches, model-based reinforcement learning attempts to have agents learn how the world behaves in general and select actions based on long-term outcomes.
Not surprisingly, model-based reinforcement learning agents have proven more efficient in longer-term planning as those required in multi-player strategy games.
While there are many advantages of model-based reinforcement learning when comes to long-term planning, the implementation of these type of agents remain challenging in practice.
For a model-based reinforcement learning agent to be efficient in an unknown environment, it needs to learn the rules of the environment from experience which ties more model-based models a specific environment and training mechanism.
Generalizing these practices across diverse model-based architectures remains an outstanding challenge of reinforcement learning models.
Enter PlaNetGoogle’s Deep Planning Network(PlaNet) is a purely model-based reinforcement learning algorithm that solves control tasks from images by efficient planning in a learned latent space.
In other words, PlaNet learns about an environment using images and uses that knowledge for log-term planning in image control tasks.
To efficiently plan long-term tasks using images, PlaNet introduces the notion of a latent dynamics model which is a compact representation of “latent states” in an image which describe representations such as velocity or positions of objects.
Instead of prediction the next image from a given image like other image-based planning models, PlaNet predicts the next latent state and that information is used to predict future images.
The following figure illustrates the latent dynamics model in more detail.
The model takes a series of input images and uses a series of encoders(grey trapezoids) to extract the hidden states(green circles).
A group of decoders(blue trapezoids) project new stages onto new images.
The latent dynamics model is a clever way to learn about images but also introduces a brand new way to represent a reinforcement learning problem.
In order to learn using this new model structure, PlanNet introduces two new concepts:A Recurrent State Space Model: A latent dynamics model with both deterministic and stochastic components, allowing to predict a variety of possible futures as needed for robust planning, while remembering information over many time steps.
A Latent Overshooting Objective: A mechanism to generalize the standard training objective for latent dynamics models to train multi-step predictions, by enforcing consistency between one-step and multi-step predictions in latent space.
This yields a fast and effective objective that improves long-term predictions and is compatible with any latent sequence model.
The two aforementioned components result key for efficient long-term planning dynamics.
A typical image-based planning scenario will require thousands of image encoder-decoders which results incredibly expensive from the computational standpoint.
However, planning in the compact latent state space is fast since we only need to predict future rewards, and not images, to evaluate an action sequence.
For instance, a PlaNet agent can imagine how the position of a ball and its distance to the goal will change for certain actions, without having to visualize the scenario.
In the following image, notice how the planning takes place directly from the hidden stages detected by the encoders(gray trapezoids) without using the expensive decoders(blue trapezoids) that were present in the previous image.
PlaNet in ActionGoogle evaluated PlaNet against a series of state-of-the-art model-free agents across six specific challenges:1) A cartpole swing-up task, with a fixed camera, so the cart can move out of sight.
The agent thus must absorb and remember information over multiple frames.
2) A finger spin task that requires predicting two separate objects, as well as the interactions between them.
3) A cheetah running task that includes contacts with the ground that are difficult to predict precisely, calling for a model that can predict multiple possible futures.
4) A cup task, which only provides a sparse reward signal once a ball is caught.
This demands accurate predictions far into the future to plan a precise sequence of actions.
5) A walker task, in which a simulated robot starts off by lying on the ground, and must first learn to stand up and then walk.
The results of the experiments were remarkable in several areas.
For starters, PlaNet used a single reinforcement learning agent to master all six tasks instead of building task-specific model-free agents.
The PlaNet agent was randomly placed into different environments without knowing the task and it was successfully able to infer the task from its image observations.
This shows that the PlaNet models can be applied generically across numerous reinforcement learning tasks without major changes.
The performance results were even more remarkable with PlaNet outperforming all model-free strategies by a wide margin all that while using 5000% less interactions on average.
The following video shows PlaNet in action.
Notice how the agent slowly masters the tasks until achieving very high levels of efficiency.
PlaNet represents one of the most exciting development in order to generalize the application of reinforcement learning techniques to long-term planning problems.
In addition to the research paper, Google and DeepMind open sourced an initial implementation of PlaNet in Github.
.. More details