Large scale continuous-time mean-variance portfolio allocation via reinforcement learning

  • 2019-08-02 08:37:19
  • Haoran Wang
  • 0


We propose to solve large scale Markowitz mean-variance (MV) portfolioallocation problem using reinforcement learning (RL). By adopting the recentlydeveloped continuous-time exploratory control framework, we formulate theexploratory MV problem in high dimensions. We further show the optimality of amultivariate Gaussian feedback policy, with time-decaying variance, in tradingoff exploration and exploitation. Based on a provable policy improvementtheorem, we devise a scalable and data-efficient RL algorithm and conduct largescale empirical tests using data from the S&P 500 stocks. We found that ourmethod consistently achieves over 10% annualized returns and it outperformseconometric methods and the deep RL method by large margins, for both long andmedium terms of investment with monthly and daily trading.


Quick Read (beta)

Large scale continuous-time mean-variance portfolio allocation via reinforcement learning

Haoran Wang
Department of Industrial Engineering and Operations Research
Columbia University
New York, NY 10027

We propose to solve large scale Markowitz mean-variance (MV) portfolio allocation problem using reinforcement learning (RL). By adopting the recently developed continuous-time exploratory control framework, we formulate the exploratory MV problem in high dimensions. We further show the optimality of a multivariate Gaussian feedback policy, with time-decaying variance, in trading off exploration and exploitation. Based on a provable policy improvement theorem, we devise a scalable and data-efficient RL algorithm and conduct large scale empirical tests using data from the S&P 500 stocks. We found that our method consistently achieves over 10% annualized returns and it outperforms econometric methods and the deep RL method by large margins, for both long and medium terms of investment with monthly and daily trading.


Large scale continuous-time mean-variance portfolio allocation via reinforcement learning

  Haoran Wang Department of Industrial Engineering and Operations Research Columbia University New York, NY 10027


noticebox[b]Preprint. Work in progress.\[email protected]

1 Introduction

Reinforcement learning (RL) has demonstrated to be sucessful in games ([Silver et al., 2016], Silver et al. [2017], Mnih et al. [2015]) and robotics (Levine et al. [2016], Peters et al. [2003]), which also raised significant attention on its applications to quantitative finance. Notable examples include large scale optimal order execution using clssical Q-learning method (Nevmyvaka et al. [2006]), portfolio allocation using direct policy search (Moody and Saffell [2001], Moody et al. [1998]), and option pricing and hedging using deep RL methods (Buehler et al. [2019]), among others.

However, most existing works only focus on RL problems with expected utility of discounted rewards. Such criteria are either unable to fully characterize the uncertainty of the decision making process in financial markets or opaque to typical investors. On the other hand, mean–variance (MV) is one of the most important criteria for portfolio choice. Initiated in the Nobel Prize winning work Markowitz [1952] for portfolio selection in a single period, such a criterion yields an asset allocation strategy that minimizes the variance of the final payoff while targeting some prespecified mean return. The popularity of the MV criterion is not only due to its intuitive and transparent nature in capturing the tradeoff between risk and reward for practitioners, but also due to the theoretically interesting issue of time-inconsistency (or Bellman’s inconsistency) inherent with the underlying stochastic optimization and control problems.

In a recent paper Wang and Zhou [2019], the authors established an RL framework for studying the continuous-time MV portfolio selection, with continuous portfolio (action) and wealth (state) spaces. Their framework adopts a general entropy-regularized, relaxed stochastic control formulation, known as the exploratory formulation, which was originally developed in Wang et al. [2019] to capture explicitly the tradeoff between exploration and exploitation in RL for continuous-time optimization problems. The paper Wang and Zhou [2019] proved the optimality of Gaussian exploration (with time-decaying variance) for the MV problem in one dimension, and proposed a data-driven algorithm, the EMV algorithm, to learn the optimal Gaussian policy of the exploratory MV problem. Their simulation shows that the EMV algorithm outperforms both a classical econometric method and the deep deterministic policy gradient (DDPG) algorithm by large margins when solving the MV problem in the setting with only one risky asset.

It is the contribution of this work to generalize the continuous-time exploratory MV framework in Wang and Zhou [2019] to large scale portfolio selection setting, with the number of risky assets being relatively large and the available training data being relatively limited. We establish the theoretical optimality of the high-dimensional Gaussian policy and design a scalable EMV algorithm to directly output portfolio allocation strategies. By switching to portfolio selection in high dimensions, we can in principle take more advantage of the diversification effect (Markowitz [1959]) to have better performance while, however, potentially encountering the challenges of low sample efficiency and instability faced by most deep RL methods (Duan et al. [2016], Henderson et al. [2018]). Nevertheless, although the EMV algorithm is an on-policy approach, it can achieve better data efficiency than the off-policy method DDPG, thanks to a provable policy improvement theorem and the explicit functional structures of the theoretical optimal Gaussian policy and value function. For instance, in a 10 years monthly trading experiment (see section 5.2) where the available data point for training is the same amount as the decision making times for testing, the EMV algorithm still can outperform various alternative methods considered in this paper. To further empirically test the performance and robustness of the EMV algorithm, we conduct experiments using both monthly and daily price data of the S&P 500 stocks, for long and medium term investment horizons. Annual returns over 10% have been consistently observed across most experiments. The EMV algorithm also demonstrated remarkable universal applicability, as it can be trained and tested respectively on different sets of data from stocks that are randomly selected and still achieves competitive and, actually, more robust performance (see Appendix D).

2 Notations & Background

2.1 Classical continuous-time MV problem

We consider the classical MV problem in continuous time (without RL), where the investment universe consists of one riskless asset (savings account) and d risky assets (e.g., stocks). Let an investment planning horizon T>0 be fixed. Denote by {xtu,0tT} the discounted wealth (i.e. state) process of an agent who rebalances her portfolio (i.e. action) investing in the risky and riskless assets with a strategy (policy) u={ut,0tT}. Here ut=(ut1,,utd) is the discounted dollar value put in the d risky assets at time t. Under the geometric Brownian motion assumption for stocks prices and the standard self-financing condition, it follows (see Appendix A) that the wealth process satisfies

dxtu=σut(ρdt+dWt),0tT, (1)

with an initial endowment being x0u=x0. Here, Wt=(Wt1,,Wtd), 0tT, is a standard d-dimensional Brownian motion defined on a filtered probability space (Ω,,{t}0tT,). The vector11 1 All vectors in this paper are taken as column vectors. ρ is typically known as the market price of risk, and σd×d is the volatility matrix which is assumed to be non-degenerate.

The classical continuous-time MV model then aims to solve the following constrained optimization problem

minuVar[xTu],subject to𝔼[xTu]=z, (2)

where {xtu,0tT} satisfies the dynamics (1) under the investment strategy (portfolio) u, and z is an investment target set at t=0 as the desired target payoff at the end of the investment horizon [0,T].

Due to the variance in its objective, (2) is known to be time inconsistent. In this paper we focus ourselves to the so-called pre-committed strategies of the MV problem, which are optimal at t=0 only. To solve (2), one first transforms it into an unconstrained problem by applying a Lagrange multiplier w:22 2 Strictly speaking, 2w is the Lagrange multiplier.

minu𝔼[(xTu)2]-z2-2w(𝔼[xTu]-z)=minu𝔼[(xTu-w)2]-(w-z)2. (3)

This problem can be solved analytically, whose solution u*={ut*,0tT} depends on w. Then the original constraint 𝔼[xTu*]=z determines the value of w. We refer a detailed derivation to Zhou and Li [2000].

2.2 Exploratory continuous-time MV problem

The classical MV solution requires the estimation of the market parameters from historical time series of assets prices. However, as well known in practice, it is difficult to estimate the investment opportunity parameters, especially the mean return vector (aka the mean–blur problem; see, e.g., Luenberger [1998]) with a workable accuracy. Moreover, the classical optimal MV strategies are often extremely sensitive to these parameters, largely due to the procedure of inverting ill-conditioned covariance matrices to obtain optimal allocation weights. In view of these two issues, the Markowitz solution can be greatly irrelevant to the underlying investment objective.

On the other hand, RL techniques do not require, and indeed often skip, any estimation of model parameters. Rather, RL algorithms, driven by historical data, output optimal (or near-optimal) allocations directly. This is made possible by direct interactions with the unknown investment environment, in a learning (exploring) while optimizing (exploiting) fashion. Following Wang et al. [2019], we introduce the “exploratory" version of the state dynamics (1). In this formulation, the control (portfolio) process u={ut,0tT} is randomized to represent exploration in RL, leading to a measure-valued or distributional control process whose density function is given by π={πt,0tT}. The dynamics (1) is changed to

dXtπ = (dρσuπt(u)𝑑u)dt+(duσσuπt(u)𝑑u)12dBt, (4)

where {Bt,0tT} is a 1-dimensional standard Brownian motion on the filtered probability space (Ω,,{}0tT,). Mathematically, (4) coincides with the relaxed control formulation in classical control theory, and it is adopted here to characterize the effect of exploration on the underlying continuous-time state dynamics change. We refer the readers to Wang et al. [2019] for a detailed discussion on the motivation of (4).

The randomized, distributional control process π={πt,0tT} is to model exploration, whose overall level is in turn captured by its accumulative differential entropy

(π):=-0Tdπt(u)lnπt(u)𝑑u𝑑t. (5)

Further, we introduce a temperature parameter (or exploration weight) λ>0 reflecting the tradeoff between exploitation and exploration. The entropy-regularized, exploratory MV problem is then to solve, for any fixed w:

minπ𝒜(0,x0)𝔼[(XTπ-w)2+λ0Tdπt(u)lnπt(u)𝑑u𝑑t]-(w-z)2, (6)

where 𝒜(0,x0) is the set of admissible distributional controls on [0,T] whose precise definition is relegated to Appendix B. Once this problem is solved with a minimizer π*={πt*,0tT}, the Lagrange multiplier w can be determined by the additional constraint 𝔼[XTπ*]=z.

The optimization objective (6) explicitly encourages exploration, in contrast to the classical problem (3) which concerns exploitation only.

3 Optimality of Gaussian Exploration

To solve the exploratory MV problem (6), we apply the classical Bellman’s principle of optimality for the optimal value function V (see Appendix B for the precise definition of V):


for x and 0t<sT. Following standard arguments, we deduce that V satisfies the Hamilton-Jacobi-Bellman (HJB) equation

vt(t,x;w)+minπ𝒫(d)d(12uσσuvxx(t,x;w)+ρσuvx(t,x;w)+λlnπ(u))π(u)𝑑u=0, (7)

with the terminal condition v(T,x;w)=(x-w)2-(w-z)2. Here, 𝒫(d) denotes the set of density functions of probability measures on d that are absolutely continuous with respect to the Lebesgue measure and v denotes the generic unknown solution to the HJB equation.

Applying the usual verification technique and using the fact that π𝒫(d) if and only if dπ(u)𝑑u=1 and π(u)0, a.e., on d, we can solve the (constrained) optimization problem in the HJB equation (7) to obtain a feedback (distributional) control whose density function is given by

𝝅(u;t,x,w) = exp(-1λ(12uσσuvxx(t,x;w)+ρσuvx(t,x;w)))dexp(-1λ(12uσσuvxx(t,x;w)+ρσuvx(t,x;w)))𝑑u (8)
= 𝒩(u|-σ-1ρvx(t,x;w)vxx(t,x;w),(σσ)-1λvxx(t,x;w)),

where 𝒩(u|β,Σ) denotes the Gaussian density function with mean vector β and covariance matrix Σ. It is assumed in (8) that vxx(t,x;w)>0, which will be verified in what follows.

Substituting the candidate optimal Gaussian feedback control policy (8) back into the HJB equation (7), the latter is transformed to

vt(t,x;w)-ρρ2vx2(t,x;w)vxx(t,x,w)+λ2(d-dln(2πeλvxx(t,x;w))+ln(|σσ|))=0, (9)

with v(T,x;w)=(x-w)2-(w-z)2, where || denotes the matrix determinant. A direct computation yields that this equation has a classical solution

v(t,x;w)=(x-w)2e-ρρ(T-t)+λd4ρρ(T2-t2)-λd2(ρρT-1dln|σσ|πλ)(T-t)-(w-z)2, (10)

which clearly satisfies vxx(t,x;w)>0, for any (t,x)[0,T]×. It then follows that the candidate optimal feedback Gaussian policy (8) reduces to

𝝅(u;t,x,w)=𝒩(u|-σ-1ρ(x-w),(σσ)-1λ2eρρ(T-t)),(t,x)[0,T]×. (11)

Finally, the optimal wealth process (4) under 𝝅 becomes

dXt*=-ρρ(Xt*-w)dt+(ρρ(Xt*-w)2+λ2eρρ(T-t))12dBt,X0*=x0. (12)

It has a unique strong solution for 0tT, as can be easily verified. We now summarize the above results in the following theorem.

Theorem 1

The optimal value function of the entropy-regularized exploratory MV problem (6) is given by

V(t,x;w)=(x-w)2e-ρρ(T-t)+λd4ρρ(T2-t2)-λd2(ρρT-1dln|σσ|πλ)(T-t)-(w-z)2, (13)

for (t,x)[0,T]×R. Moreover, the optimal feedback control is Gaussian, with its density function given by

𝝅(u;t,x,w)=𝒩(u|-σ-1ρ(x-w),(σσ)-1λ2eρρ(T-t)). (14)

The associated optimal wealth process under 𝛑 is the unique solution of the stochastic differential equation

dXt*=-ρρ(Xt*-w)dt+(ρρ(Xt*-w)2+λ2eρρ(T-t))12dBt,X0*=x0. (15)

Finally, the Lagrange multiplier w is given by w=zeρρT-x0eρρT-1.

Proof. See Appendix C.1.   

Theorem 1 indicates that the level of exploration, measured by the variance of Gaussian policy λ2σ2eρ2(T-t), decays in time. The agent initially engages in exploration at the maximum level, and reduces it gradually (although never to zero) as time approaches the end of the investment horizon. Naturally, exploitation dominates exploration as time approaches maturity. Theorem 1 presents such a decaying exploration scheme endogenously which, to our best knowledge, has not been derived in the RL literature.

Moreover, the mean of the Gaussian distribution (14) is independent of the exploration weight λ, while its variance is independent of the state x. This highlights a perfect separation between exploitation and exploration, as the former is captured by the mean and the latter by the variance of the optimal Gaussian exploration. This property is also consistent with the linear–quadratic case in the infinite horizon studied in Wang et al. [2019].

It is reasonable to expect that the exploratory problem converges to its classical counterpart as the exploration weight λ decreases to 0. Let 𝒖 be the optimal feedback control for the classical MV problem, and denote by Vcl the optimal value function. Let δa() be the Dirac measure centered at ad. Then the following result holds.

Theorem 2

For each (t,x,w)[0,T]×R×R,

limλ0𝝅(;t,x;w)=δ𝒖(t,x;w)() weakly.



Proof. See Appendix C.2.   

4 RL Algorithm Design

4.1 A policy improvement theorem

We present a policy improvement theorem that is a crucial prerequisite for our interpretable RL algorithm, the EMV algorithm, which solves the exploratory MV problem in high dimensions.

Theorem 3 (Policy Improvement Theorem)

Let wR be fixed and 𝛑=𝛑(;,,w) be an arbitrarily given admissible feedback control policy. Suppose that the corresponding value function V𝛑(,;w)C1,2([0,T)×R)C0([0,T]×R) and satisfies Vxx𝛑(t,x;w)>0, for any (t,x)[0,T)×R. Suppose further that the feedback policy 𝛑~ defined by

𝝅~(u;t,x,w)=𝒩(u|-σ-1ρVx𝝅(t,x;w)Vxx𝝅(t,x;w),(σσ)-1λVxx𝝅(t,x;w)) (16)

is admissible. Then,

V𝝅~(t,x;w)V𝝅(t,x;w),(t,x)[0,T]×. (17)

Proof. See Appendix C.3.   

The above theorem suggests that there are always policies in the Gaussian family that improves the value function of any given, not necessarily Gaussian, policy. Moreover, the Gaussian family is closed under the policy improvement scheme. Hence, without loss of generality, we can simply focus on the Gaussian policies when choosing an initial solution. The next result shows convergence of both the value functions and the policies from a specifically parameterized Gaussian policy.

Theorem 4

Let 𝛑0(u;t,x,w)=N(u|α(x-w),Σeβ(T-t)), with αRd, βR and Σ being a d×d positive definite matrix. Denote by {𝛑n(u;t,x,w),(t,x)[0,T]×R,n1} the sequence of feedback policies updated by the policy improvement scheme (16), and {V𝛑n(t,x;w),(t,x)[0,T]×R,n1} the sequence of the corresponding value functions. Then,

limn𝝅n(;t,x,w)=𝝅*(;t,x,w) weakly, (18)


limnV𝝅n(t,x;w)=V(t,x;w), (19)

for any (t,x,w)[0,T]×R×R, where 𝛑* and V are the optimal Gaussian policy (14) and the optimal value function (13), respectively.

Proof. See Appendix C.4.   

4.2 The EMV algorithm

We provide the EMV algorithm to directly learn the optimal solution of the continuous-time exploratory MV problem in high dimensions within competitive training time. Theorem 3 provides guidance for policy improvement. For the policy evaluation step, we follow Doya [2000] to minimize the continuous-time Bellman’s error

δt:=V˙t𝝅+λdπt(u)lnπt(u)𝑑u, (20)

where V˙t𝝅=V𝝅(t+Δt,Xt+Δt)-V𝝅(t,Xt)Δt is the total derivative and Δt is the discretization step for the learning algorithm. This leads to the cost function to be minimized

C(θ,ϕ)=12(ti,xi)𝒟(V˙θ(ti,xi)+λdπtiϕ(u)lnπtiϕ(u)𝑑u)2Δt, (21)

using samples collected in the set 𝒟 under the current Gaussian policy 𝝅ϕ. Here, both the value function Vθ and the Gaussian policy πϕ can be parametrized more explicitly, in view of (13), Theorem 3 and 4. The cost function (21) can then be minimized by stochastic gradient decent. Finally, the EMV algorithm updates the Lagrange multiplier w every N iterations based on stochastic approximation and the constraint 𝔼[XTπ]=z, namely, ww-α(1NjxTj-z), where xTj’s are the most recent N terminal wealth values. We refer the readers to Wang and Zhou [2019] for a more detailed description of the EMV algorithm in the one risky asset scenario.

5 Empirical Results

5.1 Data and methods

We test the EMV algorithm on price data of the S&P 500 stocks for both monthly and daily trading. For the former, we train the EMV algorithm on the 10 years monthly data33 3 All data is from Wharton Research Data Services (WRDS). from 08-31-1990 to 08-31-2000, and then test the learned allocation strategy from 09-29-2000 to 09-30-2010. The initial wealth is normalized as 1 and the 10 years target is z=8, corresponding to a 23% annualized target return. In the daily rebalancing scenario, the EMV algorithm is trained on the 1 year daily data from 01-09-2017 to 01-08-2018 and tested on the subsequent year, with a 40% return set as the target for the 1 year investment horizon.

For comparison studies, we also train and test other alternative methods for solving the portfolio allocation problem on the same data. Specifically, we consider the classical econometric methods including Black-Litterman (BL, Black and Litterman [1992]), Fama-French (FF, Fama and French [1996]) and the Markowitz portfolio (Markowitz, Markowitz [1952]). A recently developed distributionally robust MV strategy, the Robust Wasserstein Profile Inference (RWPI, Blanchet et al. [2018]), is also included. To compare EMV with deep RL method, we adjust DDPG similarly as in Wang and Zhou [2019], so that it can solve the classical MV problem (3). All experiments were performed on a MacBook Air laptop, with DDPG trained using Tensorflow.

5.2 Test I: monthly rebalancing

We first consider d=20. By randomly selecting 20 stocks for each set/seed, we compose 100 different seeds. The split of training and testing data for EMV and DDPG is fixed as described above, but we consider two types of training. The first training method is batch (off-line) RL, where both algorithms are trained for multiple episodes using one seed, following by testing on the subsequent 10 years data of that seed. The performance is then averaged over the 100 seeds. Another method is to use all the 100 seeds and select one seed randomly for each episode during training. Then both algorithms are tested on randomly selected 100 seeds over the test period and the performance is averaged as well. The second method can be seen to artificially generate randomness for training and testing, and an algorithm that performs well using this method has universality and potential to generate to data of stocks in different sectors.

For competitive performance, we adopt a rolling-horizon based training and testing for all the other methods. Specifically, each time after the 1 month ahead investment decision is made on the test set, we add the most recent price data point from the test set into the training set, and discard the most obsolete data point from the training set.

(a) Monthly rebalancing (d=20)
(b) Daily rebalancing (d=50)
Figure 1: Investment performance comparison for (a) 10 years horizon with monthly rebalancing and (b) 1 year horizon with daily rebalancing.

Figure 0(a) shows the performance of various investment strategies, including variants of the EMV algorithm with different gross leverage constraints on portfolios.44 4 Leverage is a fundamental investment tool for most hedge funds; according to Ang et al. [2011], the average gross leverage across the 208 hedge funds studied therein is 213%. Under reasonable leverage constraint, the EMV algorithm still outperforms most other methods (which have no constraints, except DDPG) by a large margin, although it was trained only using the previous 10 years monthly data.

The universal training and testing method was used for EMV and DDPG in Figure 0(a). Results for the batch method can be found in Appendix D. A remarkable fact in both cases is that the original EMV algorithm, devised to solve the exploratory MV problem (6) without constraint, achieves the target z=8 with minimal variance for most of the test period. We also report various investment outcomes in Table 1 when scaling up d, the number of stocks in the portfolio.

5.3 Test II: daily rebalancing

For daily trading with d=50, we report the performance of the EMV algorithm under different gross leverage constraints in Figure 0(b). The DDPG algorithm was not competitive in the daily trading setting (see Table 1) and, hence, omitted. For different d, Table 1 summarizes the investment outcomes and the training time (per experiment). These results were obtained using the universal method for both training and testing.

\thesubsubtable Monthly rebalancing
d=20 d=60 d=100
EMV (L=200%) 10.8% 0.31 hrs 11.2% 4.34 hrs 6.3% 1.53 hrs 5
(0.797) (1.323 ) (1.627)
DDPG (L=200%) -300.1% 4.23 hrs 476.3% 5.32 hrs -653.4% 6.68 hrs
  (Unannualized) (-0.411) (0.359 ) (-0.432)
\thesubsubtable Daily rebalancing
d=50 d=75 d=100
EMV (L=200%) 44.9% 1.36 hrs 33.0% 5.54 hrs 17.9% 1.63 hrs55 5 Only 1000 episodes were trained, while the training episodes for other experiments were 20000.
(1.347) (1.370) (1.124)
DDPG (L=200%) -189.6% 6.20 hrs -27.9% 8.45 hrs -640.6% 14.42 hrs
(-0.096) (-0.012) (-0.219)
Table 1: Annualized return (with Sharpe ratio) and training time corresponding to different d for
(a) 10 years horizon with monthly rebalancing and (b) 1 year horizon with daily rebalancing.

6 Related Work

The difficulty of seeking the global optimum for Markov Decision Process (MDP) problems under the MV criterion has been previously noted in Mannor and Tsitsiklis [2013]. In fact, the variance of reward-to-go is nonlinear in expectation and, as a result of Bellman’s inconsistency, most of the well-known RL algorithms cannot be applied directly.

Existing works on variance estimation and control generally divide into value based methods and policy based methods. Sobel [1982] obtained the Bellman’s equation for the variance of reward-to-go under a fixed, given policy. Sato et al. [2001] further derived the TD(0) learning rule to estimate the variance, followed by Sato and Kobayashi [2000] which applied this value based method to an MV portfolio selection problem. It is worth noting that due to the definition of the value function (i.e., the variance penalized expected reward-to-go) in Sato and Kobayashi [2000], Bellman’s optimality principle does not hold. As a result, it is not guaranteed that a greedy policy based on the latest updated value function will eventually lead to the true global optimal policy. The second approach, the policy based RL, was proposed in Tamar et al. [2013]. They also extended the work to linear function approximators and devised actor-critic algorithms for MV optimization problems for which convergence to the local optimum is guaranteed with probability one (Tamar and Mannor [2013]). Related works following this line of research include Prashanth and Ghavamzadeh [2013], Prashanth and Ghavamzadeh [2016], among others. Despite the various methods mentioned above, it remains an open and interesting question in RL to search for the global optimum under the MV criterion.

In this paper, rather than relying on the typical framework of discrete-time MDP and discretizing time and state/action spaces accordingly, we designed the EMV algorithm to learn the global optimal solution of the continuous-time exploratory MV problem (6) directly. As pointed out in Doya [2000], it is typically challenging to find the right granularity to discretize the state and action spaces, and naive discretization may lead to poor performance. On the other hand, grid-based discretization methods for solving the HJB equation cannot easily extend to high dimensions in practice due to the curse of dimensionality, although theoretical convergence results have been established (see Munos and Bourgine [1998], Munos [2000]). Our EMV algorithm, however, is computationally feasible and implementable in high dimensions, as demonstrated by the experiements, due to the explicit representations of the value functions and the portfolio strategies, thereby devoid of the curse of dimensionality. Note that our algorithm does not use (deep) neural networks, which have been applied extensively in literature for (high-dimensional) continuous RL problems (e.g., Lillicrap et al. [2016], Mnih et al. [2015]) but known for unstable performance, sample inefficiency as well as extensive hyperparameter tuning (Mnih et al. [2015], Duan et al. [2016], Henderson et al. [2018]), in addition to their low interpretability.66 6 Interpretability is one of the most important and pressing issues in the general artificial intelligence applications in financial industry due to, among others, the regulatory requirement.

7 Conclusions

We studied continuous-time mean-variance (MV) portfolio allocation problem in high dimensions using RL methods. Under the exploratory control framework for general continuous-time optimization problems, we formulated the exploratory MV problem in high dimensions and proved the optimality of Gaussian policy in achieving the best tradeoff between exploration and exploitation. Our EMV algorithm, designed by combining quantitative finance analysis and RL techniques to solve the exploratory MV problem, is interpretable, scalable and data efficient, thanks to a provable policy improvement theorem and efficient functional approximations based on the theoretical optimal solutions. It consistently outperforms both classical model-based econometric methods and model-free deep RL method, across different training and testing scenarios. Interesting future research includes testing the EMV algorithm for shorter trading horizons with tick data (e.g. high frequency trading), or for trading other financial instruments such as mean-variance option hedging.


The author would like to thank Prof. Xun Yu Zhou for generous support and continuing encouragement on this work. The author also wants to thank Lin (Charles) Chen for providing the results on BL, FF, Markowitz and RWPI methods.


  • Ang et al. [2011] Andrew Ang, Sergiy Gorovyy, and Gregory B Van Inwegen. Hedge fund leverage. Journal of Financial Economics, 102(1):102–126, 2011.
  • Black and Litterman [1992] Fischer Black and Robert Litterman. Global portfolio optimization. Financial Analysts Journal, 48(5):28–43, 1992.
  • Blanchet et al. [2018] Jose Blanchet, Lin Chen, and Xun Yu Zhou. Distributionally robust mean-variance portfolio selection with Wasserstein distances. arXiv preprint arXiv:1802.04885, 2018.
  • Buehler et al. [2019] Hans Buehler, Lukas Gonon, Josef Teichmann, and Ben Wood. Deep hedging. Quantitative Finance, pages 1–21, 2019.
  • Doya [2000] Kenji Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245, 2000.
  • Duan et al. [2016] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
  • Fama and French [1996] Eugene F Fama and Kenneth R French. Multifactor explanations of asset pricing anomalies. The Journal of Finance, 51(1):55–84, 1996.
  • Henderson et al. [2018] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Levine et al. [2016] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  • Lillicrap et al. [2016] Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
  • Luenberger [1998] David G Luenberger. Investment Science. Oxford University Press, New York, 1998.
  • Mannor and Tsitsiklis [2013] Shie Mannor and John N Tsitsiklis. Algorithmic aspects of mean–variance optimization in Markov decision processes. European Journal of Operational Research, 231(3):645–653, 2013.
  • Markowitz [1952] Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952.
  • Markowitz [1959] Harry Markowitz. Portfolio Selection: Efficient Diversification of Investments. Yale University Press, 1959. ISBN 9780300013726. URL
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, and Georg Ostrovski. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Moody and Saffell [2001] John Moody and Matthew Saffell. Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks, 12(4):875–889, 2001.
  • Moody et al. [1998] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17(5-6):441–470, 1998.
  • Munos [2000] Rémi Munos. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40(3):265–299, 2000.
  • Munos and Bourgine [1998] Rémi Munos and Paul Bourgine. Reinforcement learning for continuous stochastic control problems. In Advances in Neural Information Processing Systems, pages 1029–1035, 1998.
  • Nevmyvaka et al. [2006] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. Reinforcement learning for optimized trade execution. In Proceedings of the 23rd International Conference on Machine Learning, pages 673–680, 2006.
  • Peters et al. [2003] Jan Peters, Sethu Vijayakumar, and Stefan Schaal. Reinforcement learning for humanoid robotics. In Proceedings of the 3rd IEEE-RAS International Conference on Humanoid Robots, pages 1–20, 2003.
  • Prashanth and Ghavamzadeh [2013] LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances In Neural Information Processing Systems, pages 252–260, 2013.
  • Prashanth and Ghavamzadeh [2016] LA Prashanth and Mohammad Ghavamzadeh. Variance-constrained actor-critic algorithms for discounted and average reward MDPs. Machine Learning, 105(3):367–417, 2016.
  • Sato and Kobayashi [2000] Makoto Sato and Shigenobu Kobayashi. Variance-penalized reinforcement learning for risk-averse asset allocation. In International Conference on Intelligent Data Engineering and Automated Learning, pages 244–249. Springer, 2000.
  • Sato et al. [2001] Makoto Sato, Hajime Kimura, and Shibenobu Kobayashi. TD algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence, 16(3):353–362, 2001.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
  • Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, and Adrian Bolton. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
  • Sobel [1982] Matthew J Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
  • Tamar and Mannor [2013] Aviv Tamar and Shie Mannor. Variance adjusted actor critic algorithms. arXiv preprint arXiv:1310.3697, 2013.
  • Tamar et al. [2013] Aviv Tamar, Dotan Di Castro, and Shie Mannor. Temporal difference methods for the variance of the reward to go. In International Conference on Machine Learning, pages 495–503, 2013.
  • Wang and Zhou [2019] Haoran Wang and Xun Yu Zhou. Continuous-time mean-variance portfolio selection: A reinforcement learning framework. arXiv preprint arXiv:1904.11392, 2019.
  • Wang et al. [2019] Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Exploration versus exploitation in reinforcement learning: A stochastic control approach. arXiv preprint: arXiv:1812.01552v3, 2019.
  • Zhou and Li [2000] Xun Yu Zhou and Duan Li. Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization, 42(1):19–33, 2000.

Appendix A Controlled Wealth Dynamics

Let Wt=(Wt1,,Wtd), 0tT be a standard d-dimensional Brownian motion defined on a filtered probability space (Ω,,{t}0tT,) that satisfies the usual conditions. The price process of the i-th risky asset is a geometric Brownian motion governed by

dSti=Sti(μidt+σidWt),0tT,i=1,,d, (22)

with S0i=s0i>0 being the initial price at t=0, and μi, σi=(σ1i,,σdi)d being the mean return and volatility coefficients of the i-th risky asset, respectively. We denote for brevity the mean return vector by μd, and the volatility matrix by σd×d, whose i-th column represents the volatility σi of the i-th risky asset. The riskless asset has a constant interest rate r>0. We assume that σ is non-degenerate and hence there exists a d-dimensional vector ρ that satisfies σρ=μ-r𝟏, where 𝟏 is the d-dimensional vector with all components being 1. The vector ρ is known as the market price of risk. It is worth noting that the above assumptions are only made for the convenience of deriving theoretical results in the paper; in practice, all the model parameters are unknown and time-varying, and it is the goal of RL algorithms to directly output trading strategies without relying on estimation of any underlying parameters.

Denote by ut0 and ut=(ut1,,utd) the discounted dollar value put in the savings account and the d risky assets, respectively, at time t. It then follows that the discounted wealth process is xtu=i=0duti, 0tT. The self-financing condition further implies that, using (22), we have


Appendix B Value Functions and Admissible Control Distributions

In order to rigorously solve (6) by dynamic programming, we need to define the value functions. For each (s,y)[0,T)×, consider the state equation (4) on [s,T] with Xsπ=y. Define the set of admissible controls, 𝒜(s,y), as follows. Let (d) be the Borel algebra on d. A (distributional) control (or portfolio/strategy) process π={πt,stT} belongs to 𝒜(s,y), if

(i)  for each stT, πt𝒫(d) a.s.;

(ii) for each A(d), {Aπt(u)𝑑u,stT} is t-progressively measurable;

(iii) 𝔼[sTd|σu|2πt(u)𝑑u𝑑t]<;

(iv) 𝔼[|(XTπ-w)2+λsTdπt(u)lnπt(u)dudt||Xsπ=y]<.

Clearly, it follows from condition (iii) that the stochastic differential equation (SDE) (4) has a unique strong solution for stT that satisfies Xsπ=y.

Controls in 𝒜(s,y) are measure-valued (or, precisely, density-function-valued) stochastic processes, which are also called open-loop controls in the control terminology. As in the classical control theory, it is important to distinguish between open-loop controls and feedback (or closed-loop) controls (or policies as in the RL literature, or laws as in the control literature). Specifically, a deterministic mapping 𝝅(;,) is called an (admissible) feedback control if i) 𝝅(;t,x) is a density function for each (t,x)[0,T]×; ii) for each (s,y)[0,T)×, the following SDE (which is the system dynamics after the feedback policy 𝝅(;,) is applied)

dXt𝝅=(dρσu𝝅(u;t,Xt𝝅))du)dt+(duσσu𝝅(u;t,Xt𝝅))du)12dBt,X𝝅s=y, (23)

has a unique strong solution {Xt𝝅,t[s,T]}, and the open-loop control π={πt, t[s,T]}𝒜(s,y) where πt:=𝝅(;t,Xt𝝅). In this case, the open-loop control π is said to be generated from the feedback policy 𝝅(;,) with respect to the initial time and state, (s,y). It is useful to note that an open-loop control and its admissibility depend on the initial (s,y), whereas a feedback policy can generate open-loop controls for any (s,y)[0,T)×, and hence is in itself independent of (s,y). Note that throughout this paper, we have used boldfaced 𝝅 to denote feedback controls, and the normal style π to denote open-loop controls.

Now, for a fixed w, define

V(s,y;w):=infπ𝒜(s,y)𝔼[(XTπ-w)2+λ0Tdπt(u)lnπt(u)dudt|Xsπ=y]-(w-z)2, (24)

for (s,y)[0,T)×. The function V(,;w) is called the optimal value function of the problem.

Moreover, we define the value function under any given feedback control 𝝅:

V𝝅(s,y;w)=𝔼[(XT𝝅-w)2+λsTdπt(u)lnπt(u)dudt|Xs𝝅=y]-(w-z)2, (25)

for (s,y)[0,T)×, where π={πt, t[s,T]} is the open-loop control generated from 𝝅 with respect to (s,y) and {Xt𝝅,t[s,T]} is the corresponding wealth process.

Note that in the control literature, V given by (24) is called the value function. However, in the RL literature the term “value function" is also used for the objective value under a particular control (i.e. V𝝅 in (25)). So to avoid ambiguity we have called V the optimal value function in this paper.

Appendix C Proofs

C.1 Proof of Theorem 1

The main proof of Theroem 1 would be the verification arguments that aim to show the optimal value function of problem (6) is given by (13) and that the candidate optimal policy (14) is indeed admissible, based on the definitions in Appendix B. Since the current exploratory MV problem is a special case of the exploratory linear-quadratic problem extensively studied in Wang et al. [2019], a detailed proof would follow the same lines of that of Theorem 4 therein, and is left for interested readers.

Proof. We now determine the Lagrange multiplier w through the constraint 𝔼[XT*]=z. It follows from (15), along with the standard estimate that 𝔼[maxt[0,T](Xt*)2]< and Fubini’s Theorem, that


Hence, 𝔼[Xt*]=(x0-w)e-ρρt+w. The constraint 𝔼[XT*]=z now becomes (x0-w)e-ρρT+w=z, which gives w=zeρρT-x0eρρT-1.   

C.2 Proof of Theorem 2

To prove the solution of the exploratory MV problem converges to that of the classical MV problem, as λ0, we first recall the solution of the classical MV problem.

In order to apply dynamic programming for (3), we again consider the set of admissible controls, 𝒜cl(s,y), for (s,y)[0,T)×,

𝒜cl(s,y):={u={ut,t[s,T]}: u is t-progressively measurable and 𝔼[sT|σut|2dt]<}.

The (optimal) value function is defined by

Vcl(s,y;w):=infu𝒜cl(s,y)𝔼[(xTu-w)2|xsu=y]-(w-z)2, (26)

for (s,y)[0,T)×, where w is fixed. Once this problem is solved, w can be determined by the constraint 𝔼[xT*]=z, with {xt*,t[0,T]} being the optimal wealth process under the optimal portfolio u*.

The HJB equation is

ωt(t,x;w)+minud(12uσσuωxx(t,x;w)+ρσuωx(t,x;w))=0,(t,x)[0,T)×, (27)

with the terminal condition ω(T,x;w)=(x-w)2-(w-z)2.

Standard verification arguments deduce the optimal value function to be

Vcl(t,x;w)=(x-w)2e-ρρ(T-t)-(w-z)2, (28)

the optimal feedback control policy to be

𝒖(t,x;w)=-σ-1ρ(x-w), (29)

and the corresponding optimal wealth process to be the unique strong solution to the SDE

dxt*=-ρρ(xt*-w)dt-ρ(xt*-w)dWt,x0*=x0. (30)

Comparing the optimal wealth dynamics, (15) and (30), of the exploratory and classical problems, we note that they have the same drift coefficient (but different diffusion coefficients). As a result, the two problems have the same mean of optimal terminal wealth and hence the same value of the Lagrange multiplier w=zeρρT-x0eρρT-1 determined by the constraint 𝔼[xT*]=z.

Proof. The weak convergence of the feedback controls follows from the explicit forms of 𝝅 in (14) and 𝒖 in (29). The pointwise convergence of the value functions follows easily from the forms of V in (13) and Vcl in (28), together with the fact that



C.3 Proof of Theorem 3

Proof. Fix (t,x)[0,T]×. Since, by assumption, the feedback policy 𝝅~ is admissible, the open-loop control strategy, π~={π~v,v[t,T]}, generated from 𝝅~ with respect to the initial condition Xt𝝅~=x is admissible. Let {Xs𝝅~,s[t,T]} be the corresponding wealth process under π~. Applying Itô’s formula, we have

+ρσuVx𝝅(v,Xv𝝅~))π~v(u)dudv+ts(duσσuπ~v(u)du)12V𝝅(v,Xv𝝅~)dBv,s[t,T]. (31)

Define the stopping times τn:=inf{st:tsduσσuπ~v(u)𝑑u(V𝝅(v,Xv𝝅~))2𝑑vn}, for n1. Then, from (31), we obtain

-tsτnd(12uσσuVxx𝝅(v,Xv𝝅~)+ρσuVx𝝅(v,Xv𝝅~))π~v(u)dudv|Xt𝝅~=x]. (32)

On the other hand, by standard arguments and the assumption that V𝝅 is smooth, we have


for any (t,x)[0,T)×. It follows that

Vt𝝅(t,x)+minπ^𝒫(d)d(12uσσuVxx𝝅(t,x)+ρσuVx𝝅(t,x)+λlnπ^(u))π^(u)𝑑u0. (33)

Notice that the minimizer of the Hamiltonian in (33) is given by the feedback policy 𝝅~ in (16). It then follows that equation (32) implies


for (t,x)[0,T]× and s[t,T]. Now taking s=T, and using that V𝝅(T,x)=V𝝅~(T,x)=(x-w)2-(w-z)2 together with the assumption that π~ is admissible, we obtain, by sending n and applying the dominated convergence theorem, that


for any (t,x)[0,T]×.   

C.4 Proof of Theorem 4

Proof. It can be easily verified that the feedback policy 𝝅0(u;t,x,w)=𝒩(u|α(x-w),Σeβ(T-t)) generates an open-loop policy π0 that is admissible with respect to the initial (t,x). Moreover, it follows from the Feynman-Kac formula that the corresponding value function V𝝅0 satisfies the PDE

+λlnπ0(u;t,x,w))π0(u;t,x,w)du=0, (34)

with terminal condition V𝝅0(T,x;w)=(x-w)2-(w-z)2. Simplifying this equation we obtain

+Vx𝝅0(t,x;w)ρσα(x-w)-λ2(dln(2πe)+ln|Σ|+dβ(T-t))=0, (35)

where Tr() denotes the trace of a square matrix. A classical solution to equation (35) is given by


if β+2ρσα+Tr(σαασ)0 and, by


if β+2ρσα+Tr(σαασ)=0. In either case, it is easy to check that V𝝅0 satisfies the conditions in Theorem 3 and, hence, the theorem applies. The improved policy is given by (16), which, in the current case, becomes


Again, we can calculate the corresponding value function as V𝝅1(t,x;w)=(x-w)2e-ρρ(T-t)+F1(t), where F1 is a function of t only. Theorem 3 is applicable again, which yields the improved policy 𝝅2 as exactly the optimal Gaussian policy 𝝅* given in (14), together with the optimal value function V in (13). The desired convergence therefore follows, as for n2, both the policy and the value function will no longer strictly improve under the policy improvement scheme (16).   

Appendix D Empirical Results: the Batch Method

In Section 5.2, we provided the experiment results for monthly trading under universal training and testing. Another way to train and test the EMV and DDPG algorithms is based on batch (off-line) RL, as described in the main text. The batch method applies to the training and testing data from the same set/seed of d=20 stocks for each experiment, and the investment performance of the EMV algorithm over the 100 seeds are reported in Figure 1(a). Due to the extensive training time (see Table 1), we only train and test DDPG under the batch method for 8 seeds.

The batch method demonstrates qualitatively similar behavior as the universal training and testing method (see Figure 0(a)), when compared to the econometric methods and the deep RL method. A more detailed comparison between the two methods is shown in Figure 1(b). It is interesting to notice that, while the two methods perform equally well for most of the testing period over 2000-2010, the universal method is less affected by the 2008 financial crisis with less variability and higher returns. The batch method, without taking into account the data of other stocks for each portfolio/seed during training and testing, is more susceptible to stock market plunge. Nonetheless, both methods are data efficient, especially in view that, for example, the training set for the batch method contains the same number of data points as the testing set decision making points (120×20).

(a) Batch method
(b) Batch and universal methods
Figure 2: Investment performance comparison for (a) batch RL training and testing and (b) batch method and universal method (d=20).