Code is available as a notebook here and online on Google Colab here.

SetupAs TensorFlow 2.

0 is still in experimental stage, I recommend installing it in a separate (virtual) environment.

I prefer Anaconda, so I’ll illustrate with it:> conda create -n tf2 python=3.

6 > source activate tf2 > pip install tf-nightly-2.

0-previewLet’s quickly verify that everything works as expected:>>> import tensorflow as tf >>> print(tf.

__version__) 1.

13.

0-dev20190117 >>> print(tf.

executing_eagerly()) TrueDon’t worry about the 1.

13.

x version, just means that it’s an early preview.

What’s important to note here is that we’re in eager mode by default!>>> print(tf.

reduce_sum([1, 2, 3, 4, 5])) tf.

Tensor(15, shape=(), dtype=int32)If you’re not yet familiar with eager mode, then in essence it means that computation is executed at runtime, rather than through a pre-compiled graph.

You can find a good overview in the TensorFlow documentation.

Deep Reinforcement LearningGenerally speaking, reinforcement learning is a high level framework for solving sequential decision making problems.

A RL agent navigates an environment by taking actions based on some observations, receiving rewards as a result.

Most RL algorithms work by maximizing sum of rewards an agent collects in a trajectory, e.

g.

during one in-game round.

The output of an RL based algorithm is typically a policy – a function that maps states to actions.

A valid policy can be as simple as a hard-coded no-op action.

Stochastic policy is represented as a conditional probability distribution of actions, given some state.

Actor-Critic MethodsRL algorithms are often grouped based on the objective function they are optimized with.

Value-based methods, such as DQN, work by reducing the error of the expected state-action values.

Policy Gradients methods directly optimize the policy itself by adjusting its parameters, typically via gradient descent.

Calculating gradients fully is usually intractable, so instead they are often estimated via monte-carlo methods.

The most popular approach is a hybrid of the two: actor-critic methods, where agents policy is optimized through policy gradients, while value based method is used as a bootstrap for the expected value estimates.

Deep Actor-Critic MethodsWhile much of the fundamental RL theory was developed on the tabular cases, modern RL is almost exclusively done with function approximators, such as artificial neural networks.

Specifically, an RL algorithm is considered “deep” if the policy and value functions are approximated with deep neural networks.

(Asynchronous) Advantage Actor-CriticOver the years, a number of improvements have been added to address sample efficiency and stability of the learning process.

First, gradients are weighted with returns: discounted future rewards, which somewhat alleviates the credit assignment problem, and resolves theoretical issues with infinite timesteps.

Second, an advantage function is used instead of raw returns.

Advantage is formed as the difference between returns and some baseline (e.

g.

state-action estimate) and can be thought of as a measure of how good a given action is compared to some average.

Third, an additional entropy maximization term is used in objective function to ensure agent sufficiently explores various policies.

In essence, entropy measures how random a probability distribution is, maximized with uniform distribution.

Finally, multiple workers are used in parallel to speed up sample gathering while helping decorrelate them during training.

Incorporating all of these changes with deep neural networks we arrive at the two of the most popular modern algorithms: (asynchronous) advantage actor critic, or A3C/A2C for short.

The difference between the two is more technical than theoretical: as the name suggests, it boils down to how the parallel workers estimate their gradients and propagate them to the model.

With this I will wrap up our tour of DRL methods as the focus of the blog post is more on the TensorFlow 2.

0 features.

Don’t worry if you’re still unsure about the subject, things should become clearer with code examples.

If you want to learn more then one good resource to get started is Spinning Up in Deep RL.

Advantage Actor-Critic with TensorFlow 2.

0Now that we’re more or less on the same page, let’s see what it takes to implement the basis of many modern DRL algorithms: an actor-critic agent, described in previous section.

For simplicity, we won’t implement parallel workers, though most of the code will have support for it.

An interested reader could then use this as an exercise opportunity.

As a testbed we will use the CartPole-v0 environment.

Somewhat simplistic, it’s still a great option to get started with.

I always rely on it as a sanity check when implementing RL algorithms.

Policy & Value via Keras Model APIFirst, let’s create the policy and value estimate NNs under a single model class:import numpy as npimport tensorflow as tfimport tensorflow.

keras.

layers as klclass ProbabilityDistribution(tf.

keras.

Model): def call(self, logits): # sample a random categorical action from given logits return tf.

squeeze(tf.

random.

categorical(logits, 1), axis=-1)class Model(tf.

keras.

Model): def __init__(self, num_actions): super().

__init__('mlp_policy') # no tf.

get_variable(), just simple Keras API self.

hidden1 = kl.

Dense(128, activation='relu') self.

hidden2 = kl.

Dense(128, activation='relu') self.

value = kl.

Dense(1, name='value') # logits are unnormalized log probabilities self.

logits = kl.

Dense(num_actions, name='policy_logits') self.

dist = ProbabilityDistribution() def call(self, inputs): # inputs is a numpy array, convert to Tensor x = tf.

convert_to_tensor(inputs, dtype=tf.

float32) # separate hidden layers from the same input tensor hidden_logs = self.

hidden1(x) hidden_vals = self.

hidden2(x) return self.

logits(hidden_logs), self.

value(hidden_vals) def action_value(self, obs): # executes call() under the hood logits, value = self.

predict(obs) action = self.

dist.

predict(logits) return np.

squeeze(action, axis=-1), np.

squeeze(value, axis=-1)And let’s verify the model works as expected:import gym env = gym.

make('CartPole-v0') model = Model(num_actions=env.

action_space.

n)obs = env.

reset() # no feed_dict or tf.

Session() needed at all action, value = model.

action_value(obs[None, :]) print(action, value) # [1] [-0.

00145713]Things to note here:Model layers and execution path are defined separatelyThere is no “input” layer, model will accept raw numpy arraysTwo computation paths can be defined in one model via functional APIA model can contain helper methods such as action samplingIn eager mode everything works from raw numpy arraysRandom AgentNow we can move on to the fun stuff — the A2CAgent class.

First, let’s add a test method that runs through a full episode and returns sum of rewards.

class A2CAgent: def __init__(self, model): self.

model = model def test(self, env, render=True): obs, done, ep_reward = env.

reset(), False, 0 while not done: action, _ = self.

model.

action_value(obs[None, :]) obs, reward, done, _ = env.

step(action) ep_reward += reward if render: env.

render() return ep_rewardLet’s see how much our model scores with randomly initialized weights:agent = A2CAgent(model)rewards_sum = agent.

test(env)print("%d out of 200" % rewards_sum) # 18 out of 200Not even close to optimal, time to get to the training part!Loss / Objective FunctionAs I’ve described in the DRL overview section, an agent improves its policy through gradient descent based on some loss (objective) function.

In actor-critic we train on three objectives: improving policy with advantage weighted gradients plus entropy maximization, and minizing value estimate errors.

import tensorflow.

keras.

losses as klsimport tensorflow.

keras.

optimizers as koclass A2CAgent: def __init__(self, model): # hyperparameters for loss terms self.

params = {'value': 0.

5, 'entropy': 0.

0001} self.

model = model self.

model.

compile( optimizer=ko.

RMSprop(lr=0.

0007), # define separate losses for policy logits and value loss=[self.

_logits_loss, self.

_value_loss] ) def test(self, env, render=True): # unchanged from previous section .

def _value_loss(self, returns, value): # value loss as MSE between value estimates and returns return self.

params['value']*kls.

mean_squared_error(returns, value)def _logits_loss(self, acts_and_advs, logits): # a trick to input actions and advantages through same API actions, advantages = tf.

split(acts_and_advs, 2, axis=-1) # polymorphic CE loss fn, supports sparse and weighted # from_logits argument ensures normalized probabilities cross_entropy = kls.

CategoricalCrossentropy(from_logits=True) # policy loss is defined by policy gradients, weighted by advantages # note: we only calculate the loss on the actions we've actually taken # thus under the hood a sparse version of CE loss will be executed actions = tf.

cast(actions, tf.

int32) policy_loss = cross_entropy(actions, logits, sample_weight=advantages) # entropy loss can be calculated via CE over itself entropy_loss = cross_entropy(logits, logits) # here signs are flipped because optimizer minimizes return policy_loss – self.

params['entropy']*entropy_lossAnd we’re done with the objective functions!.Note how compact the code is: there’s almost more comment lines than code itself.

Agent Training LoopFinally, there’s the train loop itself.

It’s relatively long, but fairly straightforward: collect samples, calculate returns and advantages, and train the model on them.

class A2CAgent: def __init__(self, model): # hyperparameters for loss terms self.

params = {'value': 0.

5, 'entropy': 0.

0001, 'gamma': 0.

99} # unchanged from previous section .

def train(self, env, batch_sz=32, updates=1000): # storage helpers for a single batch of data actions = np.

empty((batch_sz,), dtype=np.

int32) rewards, dones, values = np.

empty((3, batch_sz)) obs_shape = env.

observation_space.

shape observations = np.

empty((batch_sz,) + obs_shape) # collect samples, send to optimizer, repeat updates times ep_rews = [0.

0] next_obs = env.

reset() for update in range(updates): for step in range(batch_sz): observations[step] = next_obs.

copy() a, v = self.

model.

action_value(next_obs[None, :]) actions[step], values[step] = a, v next_obs, rewards[step], dones[step], _ = env.

step(actions[step]) ep_rews[-1] += rewards[step] if dones[step]: ep_rews.

append(0.

0) next_obs = env.

reset() _, next_value = self.

model.

action_value(next_obs[None, :]) returns, advs = self.

_returns_advantages(rewards, dones, values, next_value) # a trick to input actions and advantages through same API acts_and_advs = np.

concatenate([actions[:, None], advs[:, None]], axis=-1) # performs a full training step on the collected batch # note: no need to mess around with gradients, Keras API handles it losses = self.

model.

train_on_batch(observations, [acts_and_advs, returns]) return ep_rewsdef _returns_advantages(self, rewards, dones, values, next_value): # next_value is the bootstrap value estimate of a future state (the critic) returns = np.

append(np.

zeros_like(rewards), next_value, axis=-1) # returns are calculated as discounted sum of future rewards for t in reversed(range(rewards.

shape[0])): returns[t] = rewards[t] + self.

params['gamma'] * returns[t+1] * (1-dones[t]) returns = returns[:-1] # advantages are returns – baseline, value estimates in our case advantages = returns – values return returns, advantagesdef test(self, env, render=True): # unchanged from previous section .

def _value_loss(self, returns, value): # unchanged from previous section .

def _logits_loss(self, acts_and_advs, logits): # unchanged from previous section .

Training & ResultsWe’re now all set to train our single-worker A2C agent on CartPole-v0!.Training process shouldn’t take longer than a couple of minutes.

After training is complete you should see an agent successfully achieve the target 200 out of 200 score.

rewards_history = agent.

train(env)print("Finished training, testing.

")print("%d out of 200" % agent.

test(env)) # 200 out of 200In the source code I include some additional helpers that print out running episode rewards and losses, along with basic plotter for the rewards_history.

Static Computational GraphWith all of this eager mode excitement you might wonder if static graph execution is even possible anymore.

Of course it is!.Moreover, it takes just one additional line to enable it!with tf.

Graph().

as_default(): print(tf.

executing_eagerly()) # False model = Model(num_actions=env.

action_space.

n) agent = A2CAgent(model) rewards_history = agent.

train(env) print("Finished training, testing.

") print("%d out of 200" % agent.

test(env)) # 200 out of 200There’s one caveat that during static graph execution we can’t just have Tensors laying around, which is why we needed that trick with CategoricalDistribution during model definition.

In fact, while I was looking for a way to execute in static mode, I discovered one interesting low level detail about models built through the Keras API…One More Thing…Remember when I said TensorFlow runs in eager mode by default, even proving it with a code snippet?.Well, I lied!.Kind of.

If you use Keras API to build and manage your models then it will attempt to compile them as static graphs under the hood.

So what you end up getting is the performance of static computational graphs with flexibility of eager execution.

You can check status of your model via the model.

run_eagerly flag.

You can also force eager mode by setting this flag to True, though most of the times you probably don’t need to – if Keras detects that there’s no way around eager mode, it will back off on its own.

To illustrate that it’s indeed running as a static graph here’s a simple benchmark:# create a 100000 samples batchenv = gym.

make('CartPole-v0')obs = np.

repeat(env.

reset()[None, :], 100000, axis=0)Eager Benchmark%%timemodel = Model(env.

action_space.

n)model.

run_eagerly = Trueprint("Eager Execution: ", tf.

executing_eagerly())print("Eager Keras Model:", model.

run_eagerly)_ = model(obs)######## Results #######Eager Execution: TrueEager Keras Model: TrueCPU times: user 639 ms, sys: 736 ms, total: 1.

38 sStatic Benchmark%%timewith tf.

Graph().

as_default(): model = Model(env.

action_space.

n)print("Eager Execution: ", tf.

executing_eagerly()) print("Eager Keras Model:", model.

run_eagerly)_ = model.

predict(obs)######## Results #######Eager Execution: FalseEager Keras Model: FalseCPU times: user 793 ms, sys: 79.

7 ms, total: 873 msDefault Benchmark%%timemodel = Model(env.

action_space.

n)print("Eager Execution: ", tf.

executing_eagerly())print("Eager Keras Model:", model.

run_eagerly)_ = model.

predict(obs)######## Results #######Eager Execution: TrueEager Keras Model: FalseCPU times: user 994 ms, sys: 23.

1 ms, total: 1.

02 sAs you can see eager mode is behind static mode, and by default our model was indeed executing statically, more or less matching explicit static graph execution.

ConclusionHopefully this has been an illustrative tour of both DRL and the things to come in TensorFlow 2.

0.

Note that this is still just a nightly preview build, not even a release candidate.

Everything is subject to change and if there’s something about TensorFlow you especially dislike (or like 🙂 ) , let the developers know!A lingering question people might have is if TensorFlow is better than PyTorch?.Maybe.

Maybe not.

Both are great libraries, so it is hard to say one way or the other.

If you’re familiar with PyTorch, you probably noticed that TensorFlow 2.

0 not only caught up, but also avoided some of the PyTorch API pitfalls.

In either case what is clear is that this competition has resulted in a net-positive outcome for both camps and I am excited to see what will become of the frameworks in the future.

Originally published at inoryy.

com on January 20, 2019.

.