We adopt Deep Reinforcement Learning algorithms to design trading strategiesfor continuous futures contracts. Both discrete and continuous action spacesare considered and volatility scaling is incorporated to create rewardfunctions which scale trade positions based on market volatility. We test ouralgorithms on the 50 most liquid futures contracts from 2011 to 2019, andinvestigate how performance varies across different asset classes includingcommodities, equity indices, fixed income and FX markets. We compare ouralgorithms against classical time series momentum strategies, and show that ourmethod outperforms such baseline models, delivering positive profits despiteheavy transaction costs. The experiments show that the proposed algorithms canfollow large market trends without changing positions and can also scale down,or hold, through consolidation periods.
Quick Read (beta)
Deep Reinforcement Learning for Trading
We adopt Deep Reinforcement Learning algorithms to design trading strategies for continuous futures contracts. Both discrete and continuous action spaces are considered and volatility scaling is incorporated to create reward functions which scale trade positions based on market volatility. We test our algorithms on the 50 most liquid futures contracts from 2011 to 2019, and investigate how performance varies across different asset classes including commodities, equity indices, fixed income and FX markets. We compare our algorithms against classical time series momentum strategies, and show that our method outperforms such baseline models, delivering positive profits despite heavy transaction costs. The experiments show that the proposed algorithms can follow large market trends without changing positions and can also scale down, or hold, through consolidation periods.
Deep Reinforcement Learning for Trading
Zihao Zhang, Stefan Zohren, and Stephen Roberts Department of Engineering Science, Oxford-Man Institute of Quantitative Finance, University of Oxford
Financial trading has been a widely researched topic and a variety of methods have been proposed to trade markets over the last few decades. These include fundamental analysis , technical analysis  and algorithmic trading . Indeed, many practitioners use a hybrid of these techniques to make trades . Algorithmic trading has arguably gained most recent interest and accounts for about 75% of trading volume in the United States stock exchanges . The advantages of algorithmic trading are widespread, ranging from strong computing foundations, faster execution and risk diversification. One key component of such trading systems is a predictive signal that can lead to alpha (excess return) and, to this end, mathematical and statistical methods are widely applied. However, due to the low signal-to-noise ratio of financial data and the dynamic nature of markets, the design of these methods is non-trivial and the effectiveness of commonly derived signals varies through time.
In recent years, machine learning algorithms have gained much popularity in many areas, with notable successes in diverse application domains including image classification  and natural language processing . Similar techniques have also been applied to financial markets in an attempt to find higher alpha (see [54, 55, 56, 47, 44, 55] for a few examples using such techniques in the context of high-frequency data). Most research focuses on regression and classification pipelines in which excess returns, or market movements, are predicted over some (fixed) horizons. However, there is little discussion related to transforming these predictive signals into actual trade positions (see  for an attempt to train a deep learning model to learn positions directly). Indeed such a mapping is non-trivial. As an example, predictive horizons are often relatively short (one day or a few days ahead if using daily data), however large trends can persist for weeks or months with some moderate consolidation periods. We therefore not only need a signal with good predictive power but also a signal that can consistently produce good directional calls.
In this work, we report on Reinforcement Learning (RL)  algorithms to tackle above problems. Instead of making predictions - followed by a trade decision based on predictions, we train our models to directly output trade positions, bypassing the explicit forecasting step. Modern portfolio theory [1, 39, 21] implies that, given a finite time horizon, an investor chooses actions to maximise some expected utility of final wealth:
where is the utility function, is the final wealth over a finite horizon and represents the change in wealth. The performance of the final wealth measure depends upon sequences of interdependent actions where optimal trading decisions do not just decide immediate trade returns but also affect subsequent future returns. As mentioned in [30, 40], this falls under the framework of optimal control theory  and forms a classical sequential decision making process. If the investor is risk-neutral, the utility function becomes linear and we only need to maximise the expected cumulative trades returns, and we observe that the problem fits exactly with the framework of RL, the goal of which is to maximise some expected cumulative rewards via an agent interacting with an uncertain environment. Under the RL framework, we can directly map different market situations to trade positions and conveniently include market frictions, such as commissions, to our reward functions, allowing trading performance to be directly optimised.
In this work, we adopt state-of-art RL algorithms to the aforementioned problem setting, including Deep Q-learning Networks (DQN) [34, 49], Policy Gradients (PG)  and Advantage Actor-Critic (A2C) . Both discrete and continuous action spaces are explored in our work, and we improve reward functions with volatility scaling [37, 17] to scale up trade positions while volatility is low, and vice versa. We show the robustness and consistency of our algorithms by testing on the 50 most liquid futures contracts  between 2011 and 2019. Our dataset consists of different asset classes including commodities, equity indexes, fixed income and FX markets. We compare our method with classical time series momentum strategies, and our method outperforms baselines models and generates positive returns in all sectors despite heavy transaction costs. Time series strategies work well in trending markets, like fixed income markets, however suffer losses in FX markets where directional moves are less usual. Our experiments show that our algorithms can monetise on large moves without changing positions and also deal with markets that are more “mean-reverting”.
The remainder of the paper is structured as follows. Section 2 introduces the current literature and Section 3 presents our methodology. In Section 4, we compare our method with baseline algorithms and show how trading performance varies among different asset classes. A description of dataset and training methodology is also included in this section. Section 5 concludes our findings and discusses extensions of our work.
2 Literature Review
In this section, we review some classical trading strategies and discuss how RL has been applied to this field. Fundamental analysis aims to measure the intrinsic value of a security by examining economic data so investors can compare the security’s current price with estimates to see if the security is undervalued or overvalued. One of the established strategies is called CAN-SLIM  and it is based on a major study of market winners from 1880 to 2009. However, a common criticism of fundamental analysis is that the timing of enter and exit of trades is not specified. Even the markets move towards the estimated price, a bad timing of entering the trades could lead to huge drawdowns and such moves in account values are often not bearable to investors, shaking them out of the markets. Technical analysis is in contrast to fundamental analysis where a security’s historical price data is used to study price patterns. Technicians place trades based on a combination of indicators such as the Relative Strength Index (RSI) and Bollinger Bands. However, due to the lack of analysis on economic or market conditions, the predictability of these signals is not strong, often leading to false breakouts.
Algorithmic trading is a more systematic approach that involves mathematical modelling and automated execution. Examples include trend-following , mean-reversion , statistical arbitrage  and delta-neutral trading strategies . We mainly review time series momentum strategies by  as we benchmark our models against their algorithms. Their work developed a very robust trading strategy by simply taking the sign of returns over the last year as a signal and demonstrated profitability for every contract considered within 58 liquid instruments over 25 years. Thenceforth, a number of methods [2, 4, 41, 27] have been proposed to enhance this strategy by estimating the trend and map them to actual trade positions. However, these strategies are designed to profit from large directional moves but can suffer huge losses if markets move sideways as the predictability of these signals deteriorates and excess turnover erodes profitability. In our work, we adopt time series momentum features along with technical indicators to represent state space and obtain trade positions directly using RL. The idea of our representation is simple: extracting information across different momentum features and outputting positions based on the aggregated information.
The current literature on RL in trading can be categorized into three main methods: critic-only, actor-only and actor-critic approach . The critic-approach, mainly DQN, is the most published method in this field [7, 22, 46, 20, 40] where a state-action value function, , is constructed to represent how good a particular action is in a state. Discrete action spaces are adopted in these works and an agent is trained to fully go long or short a position. However, a fully invested position is risky during high volatility periods, exposing one to severe risk when opposite moves occur. Ideally, one would like to scale up or down positions according to current market conditions. Doing this requires one to have large action spaces, however, the critic-approach suffers from large action spaces as we need to assign a score for each possible action.
The second most common approach is the actor-only approach [36, 35, 27, 12] where the agent directly optimises the objective function without computing the expected outcomes of each action in a state. Because a policy is directly learned, actor-only approaches can be generalised to continuous action spaces. In the work of [35, 27], offline batch gradient ascent methods can be used to optimise the objective function, like profits or Sharpe ratio, as it is differentiable end to end. However, this is different from standard RL actor-only approaches where a distribution needs to be learned for the policy. In order to study the distribution of a policy, the Policy Gradient Theorem  and Monte Carlo methods  are adopted in the training and models are updated until the end of each episode. We often experience slow learning and need a lot of samples to obtain an optimal policy as individual bad actions will be considered “good” as long as the total rewards are good, taking long times to adjust these actions.
The actor-critic approach forms the third category and aims to solve the above learning problems by updating the policy in real time. The key idea is to alternatively update two models where one, actor, controls how an agent performs given the current state, and the other, critic, measures how good the chosen action is. However, this approach is the least studied method in financial applications with limited works [26, 5, 53]. We aim to supplement the literature and study the effectiveness of this approach in trading. For a more detailed discussion on state, action spaces and reward functions, interested readers are pointed to the survey . Other important financial applications such as portfolio optimisation and trade execution are also included in this work.
In this section, we introduce our setups including state, action spaces and reward functions. We describe three RL algorithms used in our work, Deep Q-learning Networks (DQN), Policy Gradients (PG) and Advantage Actor-Critic (A2C) methods.
3.1 Markov Decision Process Formalisation
We can formulate the trading problem as a Markov Decision Process (MDP) where an agent interacts with the environment at discrete time steps. At each time step , the agent receives some representation of the environment denoted as a state . Given this state, an agent chooses an action , and based on this action, a numerical reward is given to the agent at the next time step, and the agent finds itself in a new state . The interaction between the agent and the environment produces a trajectory . At any time step , the goal of RL is to maximise the expected return (essentially the expected discounted cumulative rewards) denoted as at time :
where is the discounting factor. If the utility function in Equation 1 has a linear form and we use to represent trade returns, we can see that optimising is equivalent to optimising our expected wealth.
In the literature, many different features have been used to represent state spaces. Among these features, a security’s past price is always included and the related technical indicators are often used . In our work, we take past price, returns () over different horizons and technical indicators including Moving Average Convergence Divergence (MACD)  and the Relative Strength Index (RSI)  to represent states. At a given time step, we take the past 60 observations of each feature to form a single state. A list of our features is below:
Normalised close price series,
Returns over the past month, 2-month, 3-month and 1-year periods are used. Following , we normalise them by daily volatility adjusted to a reasonable time scale. As an example, we normalise annual returns as where is computed using an exponentially weighted moving standard deviation of with a 60-day span,
MACD indicators are proposed in  where:
where std is the 63-day rolling standard deviation of prices and is the exponentially weighted moving average of prices with a time scale ,
The RSI is an oscillating indicator moving between 0 and 100. It indicates the oversold (a reading below 20) or overbought (above 80) conditions of an asset by measuring the magnitude of recent price changes. We include this indicator with a look back window of 30 days in our state representations.
We study both discrete and continuous action spaces. For discrete action spaces, a simple action set of is used and each value represents the position directly, i.e. corresponds to a maximally short position, to no holdings and to a maximally long position. This representation of action space is also known as target orders  where a trade position is the output instead of the trading decision. Note that if the current action and next action are the same, no transaction cost will occur and we just maintain our positions. If we move from a fully long position to a short position, transaction cost will be doubled. The design of continuous action spaces is very similar to discrete case where we still output trade positions but allow actions to be any value between -1 and 1 ().
The design of the reward function depends on the utility function in Equation 1. In this work, we let the utility function be profits representing a risk-insensitive trader, and the reward at time is:
where is the volatility target and is an ex-ante volatility estimate calculated using an exponentially weighted moving standard deviation with a 60-day window on . This expression forms the volatility scaling in [37, 17, 27] where our positions is scaled up when market volatility is low and scaled down vice versa. In addition, given a volatility target, our reward is mostly driven by our actions instead of being heavily affected by market volatility. We can also consider the volatility scaling as normalising rewards from different contracts to a same scale. Since our data consists of 50 futures contracts with different price ranges, we need to normalise different rewards to a same scale for training and also for portfolio construction. is a fixed number per contract at each trade and we set it to 1. Note that our transaction cost term also includes a price term . This is necessary as again we work with many contracts and each contract has different costs, so we represent transaction cost as a fraction of traded value. The constant, bp (basis point) is the cost rate and 1bp=0.0001. As an example, if the cost rate is 1bp, we need to pay $0.1 to buy one unit of a contract priced at $1000.
We define and this expression represents additive profits. Additive profits are often used if a fixed number of shares or contracts is traded at each time. If we want to trade a fraction of our accumulated wealth at each time, multiplicative profits should be used and . In this case, we also need to change the expression of as represents the percentage of our wealth. The exact form can be found in . We stick to additive profits in our work as logarithmic transformation needs to be taken for multiplicative profits to have cumulative rewards required by the RL setup, but logarithmic transformation penalises large wealth growths.
3.2 RL Algorithms
Deep Q-learning Networks (DQN) A DQN approximates the state-action value function (Q function) to estimates how good it is for the agent to perform a given action in a given state by adopting a neural network. Suppose our Q function is parameterised by some . We minimise the mean squared error between the current and target Q to derive the optimal state-action value function:
where is the objective function. A problem is that the training of a vanilla DQN is not stable and suffers from variability. Many improvements have been made to stabilise the training process, and we adopt the following three strategies, fixed Q-targets , Double DQN  and Dueling DQN  to improve the training in our work. Fixed Q-targets and Double DQN are used to reduce policy variances and to solve the problem of “chasing tails” by using a separate network to produce target values. Dueling DQNs separate the Q-value into state value and the advantage of each action. The benefit of doing this is that the value stream gets more updates and we learn a better representation of the state values.
Policy Gradients (PG) The PG aims to maximise the expected cumulative rewards by optimising the policy directly. If we use a neural network with parameters to represent the policy, , we can generate a trajectory from the environment and obtain a sequence of rewards. We maximise the expected cumulative rewards by using gradient ascent to adjust :
where is defined in Equation 2. Compared with DQN, PG learns a policy directly and can output a probability distribution over actions. This is useful if we want to design stochastic policies or work with continuous action spaces. However, the training of PG uses Monte Carlo methods to sample trajectories from the environment and the update is done only when the episode finishes. This often results in slow training and it can get stuck at a (sub-optimal) local maximum.
Advantage Actor-Critic (A2C) The A2C is proposed to solve the training problem of PG by updating the policy in real time. It consists of two models: one is an actor network that ouputs the policy and the other is a critic network that measures how good the chosen action is in the given state. We can update the policy network by maximising the objective function:
where is the advantage function defined as:
In order to calculate advantages, we use another network, the critic network, with parameters to model the state value function , and we can update the critic network using gradient descent to minimise the TD-error:
The A2C is most useful if we are interested in continuous action spaces as we recude the policy variance by using the advantage function and update the policy in real time. The training of A2C can be done synchronously or asynchronously (A3C). In this work, we adopt the synchronous approach and execute agents in parallel on multiple environments.
4.1 Description of Dataset
We use data on 50 ratio-adjusted continuous futures contracts from the Pinnacle Data Corp CLC Database . Our dataset ranges from 2005 to 2019, and consists of a variety of asset classes including commodity, equity index, fixed income and FX. A full breakdown of the dataset can be found in Appendix A. We retrain our model at every 5 years, using all data available up to that point to optimise the parameters. Model parameters are then fixed for the next 5 years to produce out-of-sample results. In total, our testing period is from 2011 to 2019.
4.2 Baseline Algorithms
We compare our methods to the following baseline models including classical time series momentum strategies:
We compare the above baseline models with our RL algorithms, DQN, PG and A2C. DQN and PG have discrete action spaces , and A2C has a continous action space .
4.3 Training Schemes for RL
In our work, we use Long Short-term Memory (LSTM)  neural networks to model both actor and critic networks. LSTMs are traditionally used in natural language processing, but many recent works have applied them to financial time-series [14, 48, 3, 13], in particular, the work of  shows that LSTMs deliver superior performance on modelling daily financial data. We use two-layer LSTM networks with 64 and 32 units in all models, and Leaky Rectifying Linear Units (Leaky-ReLU)  are used as activation functions. As our dataset consists of different asset classes, we train a separate model for each asset class. It is a common practice to train models by grouping contracts within the same asset class, and we find it also improves our performance.
We list the value of hyperparameters for different RL algorithms in Table 1. We denote the learning rates for critic and actor networks as and . The Adam optimiser  is used for training all networks, and batch size means the size of mini-batch used in gradient descent. As previously introduced, is the discounting factor and bp is the cost rate used in training. We can treat bp as a regularising term as a large bp penalises turnovers and lets agents maintain current trade positions. The memory size shows the size of the buffer for experience replay, and we update the parameters of our target network in DQN at every steps.
|Model||Optimiser||Batch size||bp||Memory size|
4.4 Experimental Results
We test both baseline models and our methods between 2011 and 2019, and we calculate the trade returns net of transaction costs as in Equation 4 for each contract. We then form a simple portfolio by giving equal weights to each contract, and the trade return of a portfolio is:
where represents the number of contracts considered and is the trade return for contract at time . We evaluate the performance of this portfolio using following metrics as suggested in :
: annualised expected trade return,
: annualised standard deviation of trade return,
Downside Deviation (DD): annualised standard deviation of trade returns that are negative, also known as downside risk,
Sharpe: annualised Sharpe Ratio (),
Sortino: a variant of Sharpe Ratio that uses downside deviation as risk measures (),
MDD: maximum drawdown shows the maximum observed loss from any peak of a portfolio,
Calmar: the Calmar ratio compares the expected annual rate of return with maximum drawdown. In general, the higher the ratio is, the better the performance of the portfolio is,
% +ve Returns: percentage of positive trade returns,
: the ratio between positive and negative trade returns.
We present our results in Table 2 where an additional layer of portfolio-level volatility scaling is applied for each model. This brings the volatility of different methods to a same target so we can direcly compare metrics such as expected and cumulative trade returns. We also include the results without this volatility scaling for reference in Table 3 in Appendix B. Table 2 is split into five parts based on different asset classes. The results show the performance of a portfolio by only using contracts from that spefic asset class. An exception is the last part of the table where we form a portfolio using all contracts from our dataset. The cumulative trade returns for different models and asset classes are presented in Figure 1.
We can see that RL algorithms deliver better performance for most asset classes except for the equity index where a long-only strategy is better. This can be explained by the fact that most equity indices were dominated by large upward trends in the testing period. Similarly, fixed incomes had a growing trend until 2016 and entered into consolidation periods until now. Arguably, the most reasonable strategy we should have in these cases might be just to hold our positions if large trends persist. However, the long-only strategy performs the worst in commodity and FX markets where prices are more volatile. RL algorithms perform better in these markets being able to go long or short at reasonable time points. Overall, DQN obtains the best performance among all models and the second best is the A2C approach. We investigate the cause of this observation and find that A2C generates larger turnovers, leading to smaller average returns per turnover as shown in Figure 3.
|+ Ret||\makecell% of|
We also investigate the performance of our methods under different transaction costs. In the left of Figure 2, we plot the annualised Sharpe Ratio for the portfolio using all contracts at different cost rates. We can see that RL algorithms can tolerate larger cost rates and, in particular, DQN and A2C can still generate positive profits with cost rate at 25bp. To understand how cost rates (bp) translate into monetary values, we plot the average cost per contract at the right of Figure 2 and we can see that 25bp represents roughly $3.5 per contract. This is a realistic cost for a retail trader to pay but institutional traders have a different fee structure based on trading volumes and often have cheaper cost. In any case, this shows the validity of our methods in a realistic setup.
The performance of a portfolio is generally better than the performance of an individual contract as risks are diversified across a range of assets, so the return per risk is higher. In order to investigate the raw quality of our methods, we investigate the performance of individual contracts. We use the boxplots in Figure 3 to present the annualised Sharpe Ratio and average trade return per turnover for each futures contract. Overall, these results reinforce our previous findings that RL algorithms generally work better, and the performance of our method is not driven by a single contract that shows superior performance, reassuring the consistency of our model.
We adopt RL algorithms to learn trading strategies for continuous futures contracts. We discuss the connection between modern portfolio theory and the RL reward hypothesis, and show that they are equivalent if a linear utility function is used. Our analysis focuses three RL algorithms, namely Deep Q-learning Network, Policy Gradients and Advantage Actor-Critic, and investigate both discrete and continuous action spaces. We utilise features from time series momentum and technical indicators to form state representations. In addition, volatility scaling is introduced to improve reward functions. We test our methods on 50 liquid futures contracts from 2011 to 2019, and our results show that RL algorithms outperform baseline models and deliver profits even under heavy transaction costs.
In subsequent continuations of this work, we would like to investigate different forms of utility functions. In practice, an investor is often risk-averse and the objective is to maximise a risk-adjusted performance function such as the Sharpe Ratio, leading to a concave utility function. As suggested in , we can resort to distributional reinforcement learning  to obtain the entire distribution over instead of the expected Q-value. Once the distribution is learned, we can choose actions with the highest expected Q-value and the lowest standard deviation of it, maximising the Sharpe Ratio. We can also extend our methods to portfolio optimisation by modifying the action spaces to give weights of individual contracts in a portfolio. We can incorporate the reward functions with mean-variance portfolio theory  to deliver a reasonable expected trade return with minimal volatility.
The authors would like to thank Bryan Lim, Vu Nguyen, Anthony Ledford and members of Machine Learning Research Group at the University of Oxford for their helpful comments. We are most grateful to the Oxford-Man Institute of Quantitative Finance, who provided the Pinnacle dataset and computing facilities.
-  Kenneth J Arrow. The theory of risk aversion. Essays in the theory of risk-bearing, pages 90–120, 1971.
-  Nick Baltas and Robert Kosowski. Demystifying time-series momentum strategies: Volatility estimators, trading rules and pairwise correlations. Trading Rules and Pairwise Correlations (May 8, 2017), 2017.
-  Wei Bao, Jun Yue, and Yulei Rao. A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PloS one, 12(7):e0180944, 2017.
-  Jamil Baz, Nicolas Granger, Campbell R Harvey, Nicolas Le Roux, and Sandy Rattray. Dissecting investment strategies in the cross section and time series. Available at SSRN 2695101, 2015.
-  Stelios D Bekiros. Heterogeneous trading strategies with adaptive fuzzy actor–critic reinforcement learning: A behavioral approach. Journal of Economic Dynamics and Control, 34(6):1153–1170, 2010.
-  Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449–458. JMLR. org, 2017.
-  Francesco Bertoluzzo and Marco Corazza. Testing different reinforcement learning configurations for financial trading: Introduction and applications. Procedia Economics and Finance, 3:68–77, 2012.
-  Ernie Chan. Quantitative trading: How to build your own algorithmic trading business, volume 430. John Wiley & Sons, 2009.
-  Ernie Chan. Algorithmic trading: Winning strategies and their rationale, volume 625. John Wiley & Sons, 2013.
-  Pinnacle Data Corp. CLC Database. https://pinnacledata2.com/clc.html, Accessed: 2019-08-11.
-  Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493–2537, 2011.
-  Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. Deep direct reinforcement learning for financial signal representation and trading. IEEE transactions on neural networks and learning systems, 28(3):653–664, 2016.
-  Luca Di Persio and Oleksandr Honchar. Artificial neural networks architectures for stock price prediction: Comparisons and applications. International journal of circuits, systems and signal processing, 10:403–413, 2016.
-  Thomas Fischer and Christopher Krauss. Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research, 270(2):654–669, 2018.
-  Thomas G Fischer. Reinforcement learning in financial markets - A survey. Technical report, FAU Discussion Papers in Economics, 2018.
-  Benjamin Graham, David Le Fevre Dodd, Sidney Cottle, et al. Security analysis. McGraw-Hill New York, 1934.
-  Campbell R Harvey, Edward Hoyle, Russell Korgaonkar, Sandy Rattray, Matthew Sargaison, and Otto Van Hemert. The impact of volatility targeting. The Journal of Portfolio Management, 45(1):14–33, 2018.
-  Hado V Hasselt. Double q-learning. In Advances in Neural Information Processing Systems, pages 2613–2621, 2010.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Chien Yi Huang. Financial trading as a game: A deep reinforcement learning approach. arXiv preprint arXiv:1807.02787, 2018.
-  Jonathan E Ingersoll. Theory of financial decision making, volume 3. Rowman & Littlefield, 1987.
-  Olivier Jin and Hamza El-Saawy. Portfolio management using reinforcement learning. Technical report, Working paper, Stanford University, 2016.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations, 2015.
-  Donald E Kirk. Optimal control theory: An introduction. Courier Corporation, 2012.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Hailin Li, Cihan H Dagli, and David Enke. Short-term stock market timing prediction under reinforcement learning schemes. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 233–240. IEEE, 2007.
-  Bryan Lim, Stefan Zohren, and Stephen Roberts. Enhancing time-series momentum strategies using deep neural networks. The Journal of Financial Data Science, 2019.
-  Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.
-  Harry Markowitz. Portfolio selection. The journal of finance, 7(1):77–91, 1952.
-  Robert C Merton. Lifetime portfolio selection under uncertainty: The continuous-time case. The review of Economics and Statistics, pages 247–257, 1969.
-  Nicholas Metropolis and Stanislaw Ulam. The monte carlo method. Journal of the American statistical association, 44(247):335–341, 1949.
-  Richard O Michaud. The markowitz optimization enigma: Is ‘optimized’optimal? Financial Analysts Journal, 45(1):31–42, 1989.
-  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop 2013, 2013.
-  John Moody and Matthew Saffell. Learning to trade via direct reinforcement. IEEE transactions on neural Networks, 12(4):875–889, 2001.
-  John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17(5-6):441–470, 1998.
-  Tobias J Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen. Time series momentum. Journal of Financial Economics, 104(2):228–250, 2012.
-  John J Murphy. Technical analysis of the financial markets: A comprehensive guide to trading methods and applications. Penguin, 1999.
-  John W Pratt. Risk aversion in the small and in the large. In Uncertainty in Economics, pages 59–79. Elsevier, 1978.
-  Gordon Ritter. Machine learning for trading. 2017.
-  Janick Rohrbach, Silvan Suremann, and Joerg Osterrieder. Momentum and trend following trading strategies for currencies revisited-combining academia and industry. Available at SSRN 2949379, 2017.
-  Jack D Schwager. A Complete Guide to the Futures Market: Technical Analysis, Trading Systems, Fundamental Analysis, Options, Spreads, and Trading Principles. John Wiley & Sons, 2017.
-  Surendra Singhvi. How to Make Money in Stocks: A Winning System in Good Times or Bad. Management Review, 77(9):61–63, 1988.
-  Justin Sirignano and Rama Cont. Universal features of price formation in financial markets: Perspectives from deep learning. Quantitative Finance, 19(9):1449–1459, 2019.
-  Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 2. MIT press Cambridge, 1998.
-  Zhiyong Tan, Chai Quek, and Philip YK Cheng. Stock trading with cycles: A financial application of anfis and reinforcement learning. Expert Systems with Applications, 38(5):4741–4755, 2011.
-  Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis. Forecasting stock prices from the limit order book using convolutional neural networks. In 2017 IEEE 19th Conference on Business Informatics (CBI), volume 1, pages 7–12. IEEE, 2017.
-  Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis. Using deep learning to detect price change indications in financial markets. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 2511–2515. IEEE, 2017.
-  Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.
-  Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 1995–2003. JMLR. org, 2016.
-  J Welles Wilder. New concepts in technical trading systems. Trend Research, 1978.
-  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
-  Zhuoran Xiong, Xiao-Yang Liu, Shan Zhong, Anwar Walid, et al. Practical deep reinforcement learning approach for stock trading. arXiv preprint arXiv:1811.07522, 2018.
-  Zihao Zhang, Stefan Zohren, and Stephen Roberts. Bdlob: Bayesian deep convolutional neural networks for limit order books. Third workshop on Bayesian Deep Learning (NeurIPS 2018), Montréal, Canada., 2018.
-  Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deeplob: Deep convolutional neural networks for limit order books. IEEE Transactions on Signal Processing, 67(11):3001–3012, 2019.
-  Zihao Zhang, Stefan Zohren, and Stephen Roberts. Extending deep learning models for limit order books to quantile regression. Proceedings of Time Series Workshop of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019., 2019.
Our dataset consists of 50 futures contracts, and there are 25 commodity contracts, 11 equity index contracts, 5 fixed income contracts and 9 forex contracts. A detail description of each contract is below:
|DA||.MILK III, Comp|
|GI||GOLDMAN SAKS C. I.|
|ZF||FEEDER CATTLE, Electronic|
|ZH||HEATING OIL, Electronic|
|ZL||SOYBEAN OIL, Electronic|
|ZN||NATURAL GAS, Electronic|
|ZR||ROUGH RICE, Electronic|
|ZT||LIVE CATTLE, Electronic|
|ZU||CRUDE OIL, Electronic|
|ZZ||LEAN HOGS, Electronic|
.2 Equity Indexes
|ER||RUSSELL 2000, MINI|
|ES||S & P 500, MINI|
|LX||FTSE 100 INDEX|
|MD||S&P 400 (Mini Electronic)|
|SC||S & P 500, Composite|
|SP||S & P 500, Day Session|
|XU||DOW JONES EUROSTOXX50|
|XX||DOW JONES STOXX 50|
|YM||Mini Dow Jones ($5.00)|
.3 Fixed Incomes
|DT||EURO BOND (BUND)|
|FB||T-NOTE, 5-year Composite|
|TY||T-NOTE, 10-year Composite|
|AN||AUSTRALIAN, Day Session|
|BN||BRITISH POUND, Composite|
|DX||US DOLLAR INDEX|
|JN||JAPANESE YEN, Composite|
|SN||SWISS FRANC, Composite|
Table 3 presents the performance metrics for portfolios without additional layer of volatility scaling.
|+ Ret||\makecell% of|