Data-Efficient Hierarchical Reinforcement Learning — HIROSherwin ChenBlockedUnblockFollowFollowingJun 25from http://www.

cns-jocham.

de/research.

htmlIntroductionTraditional reinforcement learning algorithms have achieved encouraging success in recent years.

Their nature of reasoning on the atomic scale, however, makes them hard to scale to complex tasks.

Hierarchical Reinforcement Learning(HRL) introduces high-level abstraction, whereby the agent is able to plan on different scales.

In this post, we introduce an HRL algorithm proposed by Ofir Nachum et al.

in Google Brain on NIPS 2018.

The algorithm, known as HIerarchical Reinforcement learning with Off-policy correction(HIRO), is designed for goal-directed tasks, in which the agent tries to reach some goal state.

Note that this post is the first of two consecutive posts.

In the next post, we will discuss another HRL algorithm proposed by the same team as an improvement on HIRO, named Near-Optimal Representation Learning for Hierarchical Reinforcement Learning.

For better readability of mathematic expression, you may want to refer to my personal blogHierarchical Reinforcement Learning with Off-Policy CorrectionWe first introduce three essential questions about HRL:How should one train the lower-level policy to induce semantically distinct behavior?How should high-level policy actions be defined?How should multiple policies be trained without incurring an inordinate amount of experience collection?HIRO can be well explained by answering the above questions(from here on, we only focus on a two-level HRL agent):1.

In addition to state observations, we feed goals produced by the higher-level policy to lower-level policy so that the lower-level policy learns to exhibit different behavior for different goals it tries to achieve.

Accordingly, to guide the learning process of the lower-level policy, we defined the goal-conditioned reward function asEq.

1 low-level reward function2.

The high-level policy actions are defined to be goals, which the lower-level policy tries to achieve in a certain period of time.

Goals are either sampled from the high-level policy every c steps, gₜ∼μ^{high}, when t≡0(mod c), or otherwise computed through a fixed goal transition function.

Mathematically, a goal is defined asEq.

2 goal function3.

To improve data efficiency, we separately train high-level and low-level policies using an off-policy algorithm(e.

g.

, TD3).

Specifically, for a two-level HRL agent, the lower-level policy is trained with experience (sₜ, gₜ, aₜ, rₜ, s_{t+1}, g_{t+1}), where intrinsic reward rₜ and goal gₜ are computed by Eq.

1 and Eq.

2, respectively; the higher-level policy is trained on temporally-extended experience (sₜ, ilde gₜ, ∑ R_{t:t+c-1}, s_{t+c}), where ilde gₜ is the relabeled goal (which we will discuss in the next section), and R is the reward provided by the environment.

Since all the additional variables are defined by the agent itself, experiences collected from the environment can be shared with both policies.

Goals in HIROThe goal produced by the high-level policy describes the desired relative changes in state space.

This makes sense of the definition of goal transition function defined in Eq.

2goal transition functionwhere sₜ+gₜ computes the desired state s_{t+c}.

It also brings a nice interpretation for the goal-conditioned reward function defined in Eq.

1: the reward function simply penalizes according to the Euclidean distance between the desired state sₜ+gₜ and the next state s_{t+1} the agent reaches by taking action aₜ.

Note that in this definition, goals are in the same form as the state observations.

So far, we have made sense of goals for the lower-level policy, now let us consider its role in the high-level policy.

First, we notice that, although goals are produced by the high-level policy as high-level actions, they are in fact carried out by lower-level policy.

As the lower-level policy evolves, the lower-level actions taken to achieve the same goal changes, which in turns results in different states and rewards collected along the way.

This invalidates old high-level transitions (sₜ, ilde gₜ, ∑ R_{t:t+c-1}, s_{t+c}).

To reinstate the experience, we relabel the goal such that it is likely to induce the same low-level behavior with the current instantiation of the lower-level policy.

Mathematically, ilde gₜ is chosen to maximize the probability μ^{low}(a_{t:t+c-1}|s_{t:t+c-1}, ilde g_{t:t+c-1}), where the intermediate goals ilde g_{t+1:t+c-1} are computed using the goal transition function defined in Eq.

2.

In practice, we generally instead maximize its log probability, which could be computed as follows if the action is sampled from a Gaussian distributionEq.

3 measure for relabeled goalsTo approximately maximize this quantity, we compute this log probability for a number of goals ilde gₜ, and choose the maximal goal to relabel the experience.

For example, we calculate this quantity on eight candidate goals sampled randomly from Gaussian distribution centered at s_{t+c}-sₜ, also including the original goal gₜ and a goal corresponding to the difference s_{t+c}-sₜ in the candidate set, to have a total of 10 candidates.

The one maximizing Eq.

3 is therefore chosen to be the relabeled goal.

AlgorithmThe following figure excerpted from the paper perfectly elucidates the algorithm, where both high-level and low-level policies are trained by TD3.

from Data-Efﬁcient Hierarchical Reinforcement Learningnote that we actually have to use experiences (s_{t:t+c}, gₜ, a_{t:t+c-1}, ∑ R_{t:t+c-1}) to train the high-level policy since we have to relabel goals.

ENDWelcome to leave comments and reflections below to discuss the topic:)ReferencesOfir Nachum et al.

Data-Efﬁcient Hierarchical Reinforcement LearningScott Fujimoto et al.

Addressing Function Approximation Error in Actor-Critic Methods.. More details