AlphaStar?On the media, Reinforcement Learning has commonly appeared in the context of games; AI bots beating human professionals in games such as Go (Deepmind’s AlphaGo), Starcraft (Deepmind’s AlphaStar), and Dota (OpenAI’s OpenAI Five).

Essentially, Reinforcement Learning is a field of machine learning pertaining to optimal decision making: training an “agent” to make an optimal action from observations of a current state, in order to reach a defined ideal state.

In Reinforcement Learning, the optimal policy (a function that maps states to optimal actions) is learned through a “trial and error” error method.

The agent probabilistically chooses an action based on observations of the current state, observes the resultant state and receives a “reward” according to a defined performance metric.

That is, the data fed for training a reinforcement learning model is a tuple of {state, action, reward, next_state}.

With that data, the agent evaluates the value of the selected action, and update its parameters so that actions with better value — “optimal actions” — will have higher probability of being chosen.

This is a very surface-level explanation of reinforcement learning — check out my other blog posts where I explain core concepts and algorithms in detail!Data CollectionAs an independent high school researcher, I do not have the legal and professional qualifications to obtain real medical data.

To overcome this, I created a chemotherapy treatment simulation based on a mathematical model that represents the change in a patient’s cancer progression given current physiological state and the applied chemotherapy dosage.

The model is represented by a system of ordinary differential equations, adapted from (Zhao, 2009):where W, M, and D represent toxicity index, tumor size index, and a normalized dosage respectively, and a, b, d are model constants statistically determined in (Zhao, 2009).

The {M > 0} term indicates a crucial assumption that once the tumor size index is reduced to 0, the patient is deemed cured forever, and there will not be a relapse of cancer.

The N term indicates Gaussian noise, which was used to perturb the dynamics in an attempt to emulate patient-by-patient variabilities and stochastic physiological events.

With this model, I created a chemotherapy simulation environment inspired by the OpenAI Gym.

That is, the environment contains member functions:reset(): Creates a random initial patient instance.

reward(state, action) : Evaluates the applied dosage based on the current and resultant patient physiological state.

Returns a numerical "reward.

"step(state, action): Returns the subsequent patient state and reward given current state an current action.

Using this simulation, we can create synthetic but plausible treatment data and its corresponding chemotherapy treatment regimen.

We then use this for training our intelligent physician.

The Reward FunctionThe reward function is extremely important because it shapes the optimal behavior the agent attempts to achieve.

If you take a look at the general objective function of reinforcement learning:which is to maximize the expected episodic cumulative reward, the optimization of the agent’s behavior depends heavily on the reward function (in various reinforcement learning algorithms — as used in training my physician — there can be other terms added to the objective function).

How do we evaluate the intelligent physician’s treatment regimen?So, how do we construct a reward function that can appropriately address the physician’s goal of reducing tumor size while also controlling accumulated toxicity?Here is a break down of how we can evaluate the intelligent physician’s decisions:Positive reward if treatment results in reduction of tumor size / Negative reward if treatment results in increase of tumor size.

Positive reward if treatment results in reduction of toxicity / Negative reward if treatment results in increase of toxicity.

High positive reward if patient is “cured” (i.

e.

tumor size was reduced to 0).

High negative reward if patient is “dead” (i.

e.

the weighted sum of tumor size and toxicity index exceeds a defined threshold).

And its reward function formulation:Training the Intelligent PhysicianThe intelligent physician is based on the Soft Actor Critic Algorithm (Haarnoja, 2018).

I am going to skip details about the algorithm (as the post would get too long), but if interested, you can take a look at Tuomas Haarnoja’s paper, or Vaishak V.

Kumar’s blog post.

In essence, the Soft Actor Critic can:Learn optimal policy in continuous action spaces.

This is suitable for the chemotherapy task, as we want to produce the exact optimal dosage that we want to apply to the patient.

In contrast, learning in discretized action spaces, which, in this context, would mean selecting from a finite, small number of predefined doses, will never be the truly optimal decision (just the most optimal of the given choices).

Learn the optimal policy very efficiently and effectively through maximizing state space exploration.

Soft Actor Critic does this by incorporating an entropy term to its objective function, inhibiting greedy action selection and encouraging the agent to act as randomly as possible while learning.

So, here is an overview of training the intelligent physician:Which is:We generate a random patient instanceWe carry out a chemotherapy treatment with our current modelStore the cancer progression and treatment regimen in the intelligent physicians’ memoryUpdate neural network parameters with random batches of cancer progression and treatment regimen from intelligent physicians’ memoryRepeat steps 1~4 for 1000 generated patients.

Testing the Trained PhysicianAfter the Intelligent Physician trained on 1000 patients, it was tested on another set of 1000 hypothetical patients.

For those 1000 patients, the physician showed the following statistics:In general, the physician was able to completely reduce tumor mass of the patient at the end of the treatment, while toxicity still remained but not at a lethal level.

When “cured” was defined as the final tumor mass < 0.

01 and final toxicity index < 0.

8, the physician cured 947 out of 1000 simulated patients.

Here’s an example of a successful treatment progression extracted from the physician:In this case the physician seemed to have used a strong dose treatment during initial stages to quickly reduce tumor size, then use weak to zero doses to taper off the accumulated toxicity.

DiscussionThere was success in training the intelligent physician to produce optimal chemotherapy treatment regimens.

The physician can effectively reduce tumor size while keeping toxicity at a non-lethal level, and eventually to a completely safe level.

A 94.

7% success rate is also higher than current chemotherapy treatment success rate — according to the 2018 statistics from American Cancer Society, only 67% of people within 5 years post treatment due to reasons such as excessive accumulation of toxicity or relapse of cancer.

The more important question, however, lies with the ethics of artificial intelligence.

Decision making in medical domains has extremely high stakes.

In the context of cancer treatment, patients don’t have the luxury of testing out numerous treatment regimens.

I must therefore acknowledge the problems in the method I have taken, and the inherent weaknesses of machine learning methods:The model can only be as good as the data it is fed.

Mathematical models have difficulty replicating patient-to-patient variabilities and stochastic events that dictate real world dynamics.

Thus, my approach will never be robust enough for real-world application.

Reinforcement learning becomes problematic in real-world applications because we cannot predict or explain its behavior.

All we know is that the behavior is shaped by the defined reward function and obtained by training with millions of data points.

Especially in complex, high-dimensional state and action spaces, reinforcement learning tends to generalize poorly when facing unknown conditions.

Thus, we can never guarantee a non-catastrophic behavior.

However, modern Reinforcement Learning is a relatively young field, and is growing at a very fast pace — we can never gauge what it will be capable of in a few decades.

Regardless, real-world applications of artificial intelligence, especially when humans’ lives are at stake, should be dealt with immense responsibility and caution.

References (In this blog post)Yufan Zhao, Reinforcement learning design for cancer clinical trials, University of North Carolina Chapel Hill, 2009.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, 2018, pp.

1856–1865.

Feel free to reach out to me if you would like to know more about this work!.