A Gentle Implementation of Reinforcement Learning in Pairs Trading

Following the idea of For each pair of time series, it learns to maximize the expected trading profit [reward] by selecting the best combination of historical window, trading window, trade threshold, and stop lost [action].

In other words, we formulate it as an N-Armed Bandit problem (stateless):State space: [None] (Fixed by a dummy state — transaction_cost)Action space: [historical window], [trading window], [trading threshold], [stop loss], [confidence level]Reward: [mean return]Part 5: Putting Everything TogetherNow we are good to go.

Here are the implementation:Load relevant configs and price dataStandardize and separate them into training and testing setsCreate the state space and action spaceCreate and build the networkCreate the learning object and perform the trainingExtract the record from the learning object and perform testing analysisSettingsPair: JNJ-PGData period: 2018–01–01 to 2018–07–30Frequency: 1-minuteStates: None (fixed by setting it to the fixed transaction cost of 0.

1%)Actions:  — i.

Historical window: 60 to 600 minutes, 60-minute step — ii.

Trade window: 120 to 1200 minutes, 120-minute step — iii.

Trade threshold: (+/-)1 to 5, price step is 1 — iv.

Stop loss: (+/-)1 to 2 on top of trade threshold, price step is 0.

5 — v.

Confidence level: 90% or 95%Profit taking level: 0Reward: mean return (if it is cointegrated, otherwise it is set to the transaction cost)Trade quantity: 1 spread per buy / sell signalCalibration prices: standardizedTrading prices: actualOthers: assume trading at closing priceAfter a trial run I found that the probability output for Boltzmann exploration could go up to 1.

To mitigate the impact of extraordinarily high returns the mean reward is capped at 10.


ymlStep 1 & 2: Load relevant configs and price data, standardize and separate them into training and testing setsStep 3: Create the state space and action spaceStep 4: Create and build the networkStep 5: Create the learning object and perform the trainingStep 6: Extract the records from the learning object and perform testing analysisFrom the training result the mean reward is positive despite it is capped:Positive expected reward in trainingDistribution of training rewardThe following test trade across every minute using the optimal action obtained from the training result, excluding the maximum possible historical window and trading window:No.

of trades (pair) [LHS] and PnL [RHS] across testing samples (1-minute, 2018–5–29 to 2018–7–30)Alternatively, we can also use Zipline and Pyfolio for more sophisticated back-testing.

Although the result seems promising, in the real world the situation is complicated by numerous factors such as bid-ask spread, delay in execution, margin, interest, fractional shares etc.

However, our objective here is to give an example of how to combine various techniques in developing a systematical trading tool with a structured machine learning components.

I hope this is an enjoyable page to you.

Back to Part 2: Code DesignIllustration of the code structure2.

1 ConfigThe execution is governed by the config (dictionary).

This component allows us to encapsulate a lot of executions and tidy up the code.

It can also be used as a carrier of additional parameters.

For instance, in the previous section, the instantiation of API.

Tiingo takes the config as an input set it to an attribute.

When it calls the underlying functions, the input parameters such as start date, end date, token, no.

of sample per day and data frequency will be extracted from the config.


yml for data fetchingCurrently only a single config is implemented.

Ideally, we should implement multiple configs for different components.

Using the PyYAML package the code can recognize the fields in .

yaml /.

yml file and convert the format automatically:- Empty field: loaded into None- True/False: loaded into Boolean field True or False- 1.

0: loaded into float 1.

0- 1: loaded into integer 1- string: loaded into ‘string’- [1, 2, 3]: loaded into list [1, 2, 3]- 2018–01–01: loaded into datetime.

date(2018, 1, 30)- Finally, if we put this into the yaml file:Folder: Folder A: Math Notes Folder B: [Memo, Magazines]the package can recognize the indentation and load it into a dictionary:{'Folder A': 'Math Notes', 'Folder B': ['Memo', 'Magazines']}Check the UTIL/FileIO.

py for the reading and writing functions:2.

2 Data APIFor this we have already covered the main detail so I am gonna skip this.

If you would like to add another API I would suggest you to simply make another class, with the same interface as fetch in the class Tiingo.


3 StrategyIn .

/STRATEGY each module contains a strategy category, each strategy should be represented by one class.

The class is inherited from an abstract base class which requires it to implement the following:process(): called by the machine learning script during training or testingreward: properties that define the RL reward (i.


trade profit)record: any other attributes to be stored during the trainingInside the package we can find a strategy class EGCointegration which takes price data x and y and other parameters during the instantiation.

When the underlying functions need a sample data set, they will call the get_sample function to perform the sampling from its data attributes.

EGCointegration class in .


pyDuring the training phase, in each iteration we will need to calibrate the p-value and coefficients to decide whether and how a pair trading should be triggered.

These executions are embedded in the same class.

when the process is called, the object will automatically perform the sampling from its data attributes and run the calibration.

Based on the calibrated result the function will get a reward and record and set them to the corresponding attributes.

See more about cointegration and its testing in Part 3.

Key functions for calibration in EGCointegration2.

4 Basic Building Blocks, Processors, and ML AlgorithmsThese components are highly integrated and governed not only by the config but also the tailor-made agent which control the whole ML process which is highly automated.

Many ML algorithms were hard-coded.

That means if the logic needs to be fine tuned, the code has to be amended which is a bit inconvenient.

Here, although the design is a bit complicated, if you can understand the style you will be able to expand it in any way you want.

Recently, Google has released an open-source library for reinforcement learning (RL) called TF-Agents.

Feel free to check this out.

Some concepts are similar, but the main focus of our code is on the automation so you may use that as a foundation if you would like to build a new one.



1 Basic Building BlocksAgentAgent class in Basic.

pyIt is the main body that runs and control the processes in ML.

In RL, it has another layer of implication: in general it is the component that receives the states of the environment and makes decision on what action to take accordingly.

The Agent class is meant to be inherited by the machine learning class.

It should be initiated with a Network object and a config dictionary.

Major functions include:- docking: attach the Network input and output layers- assign_network: assign new Network to the Agent object and connect – set_session: set TensorFlow – get_counter: extract the parameters from config and get a dictionary of StepCounter objects for looping or increments such as varying probability- save_model / restore_model: save and restore model in / from .

ckpt file- process: abstract method to be implemented for training or testingNetworkA typical way of building a TensorFlow neural network is something like this inside which the layers and the parameters in each of them are hard-coded:Alternatively we could also build a function that repeats the above process, forfeiting the flexibility in setting the layer arguments.

If you want to build a ML system or something with GUI with flexibility in customizing the detail for each layer (i.


layer type, layer inputs, layer arguments) while preserving the automaticity, here comes a suggestion:Network and TFLayer in Basics.

pyThe two functions on the left are under the class Network.

build_layers: it takes a dictionary layer_dict as an input and construct the network by sequentially adding layers selected from the TFLayer class as shown on the right hand side.

As long as for each layer the parameters are properly defined, this function can be called recursively to add layers on top of the existing final layer in the current network.

Every layer is set to the attribute of the Network object so their name must be unique.

add_layer_duplicates: similar to build_layers, it takes a layer_dict as an input, and require an input of n_copy which specify how many copies of the layer(s) prescribed by the layer_dict should be added on top of the existing network.

New names will be created for the duplicated layers by concatenating the layer name and the number of that layer among the copies.

For example:The steps to create a network:Initiate an Network object.

This has to be instantiated by the first input layer which is the tf.

placeholder in this example.

Build the network based on layer_dict1.

It specifies 2 layers: an ‘one_hot’ layer which is actually tf.

one_hot with 5 outputs, and a ‘coint1’ layer which is tf.



fully_connected with 10 outputs.

The input arguments of the tf.



fully_connected are defined by the key ‘layer_para’.

Expand the network by adding copies of layer prescribed by layer_dict2.

The layer ‘coint2’ with 10 outputs is added to the current network for 3 times.

Therefore, the Network object N now should have 6 attributes in total.

Each of them is a layer with predefined properties:Since the construction of the network is based on the layer dictionary, automation comes into ply if the generation of such dictionary is streamlined, and we no longer need to hard code the network every time when we build something new.

SpaceBasically it refers to a sample space object.

It takes a dictionary of list as an input and create the sample space by making full combinations across list elements.

For example, for the following sample space:space_dict = {'dice': [1, 2, 3, 4, 5, 6], 'coin': ['H', 'T']}S = Space.

states_dictS contains all combinations of ‘dice’ and ‘coin’, 12 elements in total.

It contains the necessary functions that convert the sample from dictionary to a single index, list of indices, or one_hot array and vice versa that could fit the purpose of adapting different kind of input or output carriers in TensorFlow.

StepCounterDuring training, some parameters are incremental such as the current step in for loop, or the learning rate is set to be variable.

We may even want to add a buffer before the actual step is triggered (i.


the learning rate start to drop after 100 loops).

Instead of hard coding these in the script, we can have a step counter to perform the above.

The counter also incorporates the ability to buffer pre-train steps.

For example, the actual counting value starts to change only after 100 buffering steps.



2 ProcessorsA Processor class should take an Agent object as an input for initiation.

When the process is called it will extract relevant parameters from the Agent object, including the attached config dictionary, and attach any output to the data dictionary which is an attribute of the Agent.

We can actually create another object to carry these attributes but for simplicity let’s not overload the structure in here.

State Space and Action SpaceBoth of them inherit the parent class Space and are used to generate state samples or action samples.

Based on the method specified in config they can output the samples in different forms (i.


index/one hot/dictionary) or different ways (with/without exploration) serving different purposes such as network training or taken as the input of the process function in the Strategy object.

StateSpace and ActionSpace in PROCESSOR/MachineLearning.

pyReward EngineIt takes an engine object which contain a process methods.

In our example it will be an EGCointegration object.

RewardEngine classExplorationThis article gives a very good introduction to the exploration methods in reinforcement learning.

The purpose of this object is to explore possible actions.

The selected method will return an action index to the data carrier in the Agent object.

The exploration is implemented when the process function in the ActionSpace is called.

Experience BufferThis leverages the Experience Replay implementation in this article.

The purpose is to store the samples and results along the training process, and re-sample from the buffer to allow the agent to re-learn from the history.

RecorderLast but not least, I created a Recorder class which can be used to keep track of the records stored in the data dictionary inside the Agent object.

We can select the field we would like it to store by specifying the key names in the RecorderDataField field in the config file:RecorderDataField: [NETWORK_ACTION, ENGINE_REWARD]Recorder class2.


3 ML AlgorithmsWith the components described above, we can tailor make any class that takes these building blocks and create a running procedure.

This is the only part that needs to be customized for different purpose, but still the logic is pretty standardized for similar cases.

For example, in this project I have created a ContextualBandit class which can actually perform either N-Armed bandit or contextual bandit running, subject to the number of state.

If we would like to run it for N-Armed bandit problem we could just specify a state space with a single fixed state (dummy).

ContextualBandit class in MAIN/Reinforcement.

py__init__: initiates the object and inherits the parent methods and properties.

The TensorFlow machine learning attributes are defined in here as well.

After all the processors described above will be instantiated by composition, taking the object itself as an input argument (agent).

update_network: extracts the samples from data dictionary and update the TensorFlow layers and network.

buffering: store the sample in the ExperienceBuffer object if specified in the config.

create_sample_list: create samples for experience buffering.

process: the main procedure that controls the flow of the training or testing.

It takes a tf.

Session() and perform the looping based on the values in the StepCounter objects initiated by the Agent.

DisclaimerThis article and the relevant codes and content are purely informative and none of the information provided constitutes any recommendation regarding any security, transaction or investment strategy for any specific person.

The implementation described in this article could be risky and the market condition could be volatile and differ from the period covered above.

All trading strategies and tools are implemented at the users’ own risk.

Bibliography[1] Dickey, D.


, Fuller, W.


, Distribution of the estimators for autoregressive time series with a unit root (1979), Journal of the American Statistical Association.

74(366): 427–431.

[2] Engle, R.


, Granger, C.



, Co-integration and error correction: representation, estimation, and testing (1987), Econometrica 55(2): 251–276[3] Gatev, E.

, Goetzmann, W.


, and Rouwenhorst, K.


, Pairs trading: performance of a relative-value arbitrage rule (2006), The Review of Financial Studies 19(3): 797–827[4] Granger, C.


, Some properties of time series data and their use in econometric model specification (1981), Journal of Economics 16(1): 121–130[5] Johansen, S.

, Statistical analysis of cointegration vectors (1988), Journal of Economic Dynamics and Control 12(2–3): 231–254[6] Krauss, C.

, Statistical arbitrage pairs trading strategies: review and outlook (2017), Journal of Economics Surveys 31(2): 513–545[7] Stock, J.


, Asymptotic properties of least squares estimators of cointegrating vectors (1987), Econometrica 55: 277–302.

[8] Sutton, R.


, Barto, A.


, Reinforcement Learning: An Introduction (1998), The MIT Press, Second Edition.

. More details

Leave a Reply