Abstract
We propose to solve large scale Markowitz meanvariance (MV) portfolioallocation problem using reinforcement learning (RL). By adopting the recentlydeveloped continuoustime exploratory control framework, we formulate theexploratory MV problem in high dimensions. We further show the optimality of amultivariate Gaussian feedback policy, with timedecaying variance, in tradingoff exploration and exploitation. Based on a provable policy improvementtheorem, we devise a scalable and dataefficient RL algorithm and conduct largescale empirical tests using data from the S&P 500 stocks. We found that ourmethod consistently achieves over 10% annualized returns and it outperformseconometric methods and the deep RL method by large margins, for both long andmedium terms of investment with monthly and daily trading.
Quick Read (beta)
Large scale continuoustime meanvariance portfolio allocation via reinforcement learning
Abstract
We propose to solve large scale Markowitz meanvariance (MV) portfolio allocation problem using reinforcement learning (RL). By adopting the recently developed continuoustime exploratory control framework, we formulate the exploratory MV problem in high dimensions. We further show the optimality of a multivariate Gaussian feedback policy, with timedecaying variance, in trading off exploration and exploitation. Based on a provable policy improvement theorem, we devise a scalable and dataefficient RL algorithm and conduct large scale empirical tests using data from the S&P $500$ stocks. We found that our method consistently achieves over $10\%$ annualized returns and it outperforms econometric methods and the deep RL method by large margins, for both long and medium terms of investment with monthly and daily trading.
Large scale continuoustime meanvariance portfolio allocation via reinforcement learning
Haoran Wang Department of Industrial Engineering and Operations Research Columbia University New York, NY 10027
noticebox[b]Preprint. Work in progress.\[email protected]
1 Introduction
Reinforcement learning (RL) has demonstrated to be sucessful in games ([Silver et al., 2016], Silver et al. [2017], Mnih et al. [2015]) and robotics (Levine et al. [2016], Peters et al. [2003]), which also raised significant attention on its applications to quantitative finance. Notable examples include large scale optimal order execution using clssical Qlearning method (Nevmyvaka et al. [2006]), portfolio allocation using direct policy search (Moody and Saffell [2001], Moody et al. [1998]), and option pricing and hedging using deep RL methods (Buehler et al. [2019]), among others.
However, most existing works only focus on RL problems with expected utility of discounted rewards. Such criteria are either unable to fully characterize the uncertainty of the decision making process in financial markets or opaque to typical investors. On the other hand, mean–variance (MV) is one of the most important criteria for portfolio choice. Initiated in the Nobel Prize winning work Markowitz [1952] for portfolio selection in a single period, such a criterion yields an asset allocation strategy that minimizes the variance of the final payoff while targeting some prespecified mean return. The popularity of the MV criterion is not only due to its intuitive and transparent nature in capturing the tradeoff between risk and reward for practitioners, but also due to the theoretically interesting issue of timeinconsistency (or Bellman’s inconsistency) inherent with the underlying stochastic optimization and control problems.
In a recent paper Wang and Zhou [2019], the authors established an RL framework for studying the continuoustime MV portfolio selection, with continuous portfolio (action) and wealth (state) spaces. Their framework adopts a general entropyregularized, relaxed stochastic control formulation, known as the exploratory formulation, which was originally developed in Wang et al. [2019] to capture explicitly the tradeoff between exploration and exploitation in RL for continuoustime optimization problems. The paper Wang and Zhou [2019] proved the optimality of Gaussian exploration (with timedecaying variance) for the MV problem in one dimension, and proposed a datadriven algorithm, the EMV algorithm, to learn the optimal Gaussian policy of the exploratory MV problem. Their simulation shows that the EMV algorithm outperforms both a classical econometric method and the deep deterministic policy gradient (DDPG) algorithm by large margins when solving the MV problem in the setting with only one risky asset.
It is the contribution of this work to generalize the continuoustime exploratory MV framework in Wang and Zhou [2019] to large scale portfolio selection setting, with the number of risky assets being relatively large and the available training data being relatively limited. We establish the theoretical optimality of the highdimensional Gaussian policy and design a scalable EMV algorithm to directly output portfolio allocation strategies. By switching to portfolio selection in high dimensions, we can in principle take more advantage of the diversification effect (Markowitz [1959]) to have better performance while, however, potentially encountering the challenges of low sample efficiency and instability faced by most deep RL methods (Duan et al. [2016], Henderson et al. [2018]). Nevertheless, although the EMV algorithm is an onpolicy approach, it can achieve better data efficiency than the offpolicy method DDPG, thanks to a provable policy improvement theorem and the explicit functional structures of the theoretical optimal Gaussian policy and value function. For instance, in a $10$ years monthly trading experiment (see section 5.2) where the available data point for training is the same amount as the decision making times for testing, the EMV algorithm still can outperform various alternative methods considered in this paper. To further empirically test the performance and robustness of the EMV algorithm, we conduct experiments using both monthly and daily price data of the S&P $500$ stocks, for long and medium term investment horizons. Annual returns over $10\%$ have been consistently observed across most experiments. The EMV algorithm also demonstrated remarkable universal applicability, as it can be trained and tested respectively on different sets of data from stocks that are randomly selected and still achieves competitive and, actually, more robust performance (see Appendix D).
2 Notations & Background
2.1 Classical continuoustime MV problem
We consider the classical MV problem in continuous time (without RL), where the investment universe consists of one riskless asset (savings account) and $d$ risky assets (e.g., stocks). Let an investment planning horizon $T>0$ be fixed. Denote by $\{{x}_{t}^{u},0\le t\le T\}$ the discounted wealth (i.e. state) process of an agent who rebalances her portfolio (i.e. action) investing in the risky and riskless assets with a strategy (policy) $u=\{{u}_{t},0\le t\le T\}$. Here ${u}_{t}=({u}_{t}^{1},\mathrm{\dots},{u}_{t}^{d})$ is the discounted dollar value put in the $d$ risky assets at time $t$. Under the geometric Brownian motion assumption for stocks prices and the standard selffinancing condition, it follows (see Appendix A) that the wealth process satisfies
$$d{x}_{t}^{u}=\sigma {u}_{t}\cdot (\rho dt+d{W}_{t}),0\le t\le T,$$  (1) 
with an initial endowment being ${x}_{0}^{u}={x}_{0}\in \mathbb{R}$. Here, ${W}_{t}=({W}_{t}^{1},\mathrm{\dots},{W}_{t}^{d})$, $0\le t\le T$, is a standard $d$dimensional Brownian motion defined on a filtered probability space $(\mathrm{\Omega},\mathcal{F},{\{{\mathcal{F}}_{t}\}}_{0\le t\le T},\mathbb{P})$. The vector^{1}^{1} 1 All vectors in this paper are taken as column vectors. $\rho $ is typically known as the market price of risk, and $\sigma \in {\mathbb{R}}^{d\times d}$ is the volatility matrix which is assumed to be nondegenerate.
The classical continuoustime MV model then aims to solve the following constrained optimization problem
$\underset{u}{\mathrm{min}}\text{Var}[{x}_{T}^{u}],\text{subject to}\mathbb{E}[{x}_{T}^{u}]=z,$  (2) 
where $\{{x}_{t}^{u},0\le t\le T\}$ satisfies the dynamics (1) under the investment strategy (portfolio) $u$, and $z\in \mathbb{R}$ is an investment target set at $t=0$ as the desired target payoff at the end of the investment horizon $[0,T]$.
Due to the variance in its objective, (2) is known to be time inconsistent. In this paper we focus ourselves to the socalled precommitted strategies of the MV problem, which are optimal at $t=0$ only. To solve (2), one first transforms it into an unconstrained problem by applying a Lagrange multiplier $w$:^{2}^{2} 2 Strictly speaking, $2w\in \mathbb{R}$ is the Lagrange multiplier.
$$\underset{u}{\mathrm{min}}\mathbb{E}[{({x}_{T}^{u})}^{2}]{z}^{2}2w\left(\mathbb{E}[{x}_{T}^{u}]z\right)=\underset{u}{\mathrm{min}}\mathbb{E}[{({x}_{T}^{u}w)}^{2}]{(wz)}^{2}.$$  (3) 
This problem can be solved analytically, whose solution ${u}^{*}=\{{u}_{t}^{*},0\le t\le T\}$ depends on $w$. Then the original constraint $\mathbb{E}[{x}_{T}^{{u}^{*}}]=z$ determines the value of $w$. We refer a detailed derivation to Zhou and Li [2000].
2.2 Exploratory continuoustime MV problem
The classical MV solution requires the estimation of the market parameters from historical time series of assets prices. However, as well known in practice, it is difficult to estimate the investment opportunity parameters, especially the mean return vector (aka the mean–blur problem; see, e.g., Luenberger [1998]) with a workable accuracy. Moreover, the classical optimal MV strategies are often extremely sensitive to these parameters, largely due to the procedure of inverting illconditioned covariance matrices to obtain optimal allocation weights. In view of these two issues, the Markowitz solution can be greatly irrelevant to the underlying investment objective.
On the other hand, RL techniques do not require, and indeed often skip, any estimation of model parameters. Rather, RL algorithms, driven by historical data, output optimal (or nearoptimal) allocations directly. This is made possible by direct interactions with the unknown investment environment, in a learning (exploring) while optimizing (exploiting) fashion. Following Wang et al. [2019], we introduce the “exploratory" version of the state dynamics (1). In this formulation, the control (portfolio) process $u=\{{u}_{t},0\le t\le T\}$ is randomized to represent exploration in RL, leading to a measurevalued or distributional control process whose density function is given by $\pi =\{{\pi}_{t},0\le t\le T\}$. The dynamics (1) is changed to
$d{X}_{t}^{\pi}$  $=$  $\left({\displaystyle {\int}_{{\mathbb{R}}^{d}}}{\rho}^{\prime}\sigma u{\pi}_{t}(u)\mathit{d}u\right)dt+{\left({\displaystyle {\int}_{{\mathbb{R}}^{d}}}{u}^{\prime}{\sigma}^{\prime}\sigma u{\pi}_{t}(u)\mathit{d}u\right)}^{\frac{1}{2}}d{B}_{t},$  (4) 
where $\{{B}_{t},0\le t\le T\}$ is a 1dimensional standard Brownian motion on the filtered probability space $(\mathrm{\Omega},\mathcal{F},{\{\mathcal{F}\}}_{0\le t\le T},\mathbb{P})$. Mathematically, (4) coincides with the relaxed control formulation in classical control theory, and it is adopted here to characterize the effect of exploration on the underlying continuoustime state dynamics change. We refer the readers to Wang et al. [2019] for a detailed discussion on the motivation of (4).
The randomized, distributional control process $\pi =\{{\pi}_{t},0\le t\le T\}$ is to model exploration, whose overall level is in turn captured by its accumulative differential entropy
$$\mathscr{H}(\pi ):={\int}_{0}^{T}{\int}_{{\mathbb{R}}^{d}}{\pi}_{t}(u)\mathrm{ln}{\pi}_{t}(u)\mathit{d}u\mathit{d}t.$$  (5) 
Further, we introduce a temperature parameter (or exploration weight) $\lambda >0$ reflecting the tradeoff between exploitation and exploration. The entropyregularized, exploratory MV problem is then to solve, for any fixed $w\in \mathbb{R}$:
$$\underset{\pi \in \mathcal{A}(0,{x}_{0})}{\mathrm{min}}\mathbb{E}\left[{({X}_{T}^{\pi}w)}^{2}+\lambda {\int}_{0}^{T}{\int}_{{\mathbb{R}}^{d}}{\pi}_{t}(u)\mathrm{ln}{\pi}_{t}(u)\mathit{d}u\mathit{d}t\right]{(wz)}^{2},$$  (6) 
where $\mathcal{A}(0,{x}_{0})$ is the set of admissible distributional controls on $[0,T]$ whose precise definition is relegated to Appendix B. Once this problem is solved with a minimizer ${\pi}^{*}=\{{\pi}_{t}^{*},0\le t\le T\}$, the Lagrange multiplier $w$ can be determined by the additional constraint $\mathbb{E}[{X}_{T}^{{\pi}^{*}}]=z$.
3 Optimality of Gaussian Exploration
To solve the exploratory MV problem (6), we apply the classical Bellman’s principle of optimality for the optimal value function $V$ (see Appendix B for the precise definition of $V$):
$$V(t,x;w)=\underset{\pi \in \mathcal{A}(t,x)}{inf}\mathbb{E}[V(s,{X}_{s}^{\pi};w)+\lambda {\int}_{t}^{s}{\int}_{{\mathbb{R}}^{d}}{\pi}_{l}(u)\mathrm{ln}{\pi}_{l}(u)dudl{X}_{t}^{\pi}=x],$$ 
for $x\in \mathbb{R}$ and $$. Following standard arguments, we deduce that $V$ satisfies the HamiltonJacobiBellman (HJB) equation
$${v}_{t}(t,x;w)+\underset{\pi \in \mathcal{P}({\mathbb{R}}^{d})}{\mathrm{min}}{\int}_{{\mathbb{R}}^{d}}\left(\frac{1}{2}{u}^{\prime}{\sigma}^{\prime}\sigma u{v}_{xx}(t,x;w)+{\rho}^{\prime}\sigma u{v}_{x}(t,x;w)+\lambda \mathrm{ln}\pi (u)\right)\pi (u)\mathit{d}u=0,$$  (7) 
with the terminal condition $v(T,x;w)={(xw)}^{2}{(wz)}^{2}$. Here, $\mathcal{P}\left({\mathbb{R}}^{d}\right)$ denotes the set of density functions of probability measures on ${\mathbb{R}}^{d}$ that are absolutely continuous with respect to the Lebesgue measure and $v$ denotes the generic unknown solution to the HJB equation.
Applying the usual verification technique and using the fact that $\pi \in \mathcal{P}({\mathbb{R}}^{d})$ if and only if ${\int}_{{\mathbb{R}}^{d}}\pi (u)\mathit{d}u=1$ and $\pi (u)\ge 0$, a.e., on ${\mathbb{R}}^{d}$, we can solve the (constrained) optimization problem in the HJB equation (7) to obtain a feedback (distributional) control whose density function is given by
${\bm{\pi}}^{\ast}(u;t,x,w)$  $=$  $\frac{\mathrm{exp}\left(\frac{1}{\lambda}\left(\frac{1}{2}{u}^{\prime}{\sigma}^{\prime}\sigma u{v}_{xx}(t,x;w)+{\rho}^{\prime}\sigma u{v}_{x}(t,x;w)\right)\right)}{{\int}_{{\mathbb{R}}^{d}}\mathrm{exp}\left(\frac{1}{\lambda}\left(\frac{1}{2}{u}^{\prime}{\sigma}^{\prime}\sigma u{v}_{xx}(t,x;w)+{\rho}^{\prime}\sigma u{v}_{x}(t,x;w)\right)\right)\mathit{d}u}$  (8)  
$=$  $\mathcal{N}\left(u{\sigma}^{1}\rho {\displaystyle \frac{{v}_{x}(t,x;w)}{{v}_{xx}(t,x;w)}},{\left({\sigma}^{\prime}\sigma \right)}^{1}{\displaystyle \frac{\lambda}{{v}_{xx}(t,x;w)}}\right),$ 
where $\mathcal{N}(u\beta ,\mathrm{\Sigma})$ denotes the Gaussian density function with mean vector $\beta $ and covariance matrix $\mathrm{\Sigma}$. It is assumed in (8) that ${v}_{xx}(t,x;w)>0$, which will be verified in what follows.
Substituting the candidate optimal Gaussian feedback control policy (8) back into the HJB equation (7), the latter is transformed to
$${v}_{t}(t,x;w)\frac{{\rho}^{\prime}\rho}{2}\frac{{v}_{x}^{2}(t,x;w)}{{v}_{xx}(t,x,w)}+\frac{\lambda}{2}\left(dd\mathrm{ln}\left(\frac{2\pi e\lambda}{{v}_{xx}(t,x;w)}\right)+\mathrm{ln}\left({\sigma}^{\prime}\sigma \right)\right)=0,$$  (9) 
with $v(T,x;w)={(xw)}^{2}{(wz)}^{2}$, where $\cdot $ denotes the matrix determinant. A direct computation yields that this equation has a classical solution
$$v(t,x;w)={(xw)}^{2}{e}^{{\rho}^{\prime}\rho (Tt)}+\frac{\lambda d}{4}{\rho}^{\prime}\rho \left({T}^{2}{t}^{2}\right)\frac{\lambda d}{2}\left({\rho}^{\prime}\rho T\frac{1}{d}\mathrm{ln}\frac{{\sigma}^{\prime}\sigma }{\pi \lambda}\right)(Tt){(wz)}^{2},$$  (10) 
which clearly satisfies ${v}_{xx}(t,x;w)>0$, for any $(t,x)\in [0,T]\times \mathbb{R}$. It then follows that the candidate optimal feedback Gaussian policy (8) reduces to
$${\bm{\pi}}^{\ast}(u;t,x,w)=\mathcal{N}\left(u{\sigma}^{1}\rho (xw),{\left({\sigma}^{\prime}\sigma \right)}^{1}\frac{\lambda}{2}{e}^{{\rho}^{\prime}\rho (Tt)}\right),(t,x)\in [0,T]\times \mathbb{R}.$$  (11) 
Finally, the optimal wealth process (4) under ${\bm{\pi}}^{\ast}$ becomes
$$d{X}_{t}^{*}={\rho}^{\prime}\rho ({X}_{t}^{*}w)dt+{\left({\rho}^{\prime}\rho {\left({X}_{t}^{*}w\right)}^{2}+\frac{\lambda}{2}{e}^{{\rho}^{\prime}\rho (Tt)}\right)}^{\frac{1}{2}}d{B}_{t},{X}_{0}^{*}={x}_{0}.$$  (12) 
It has a unique strong solution for $0\le t\le T$, as can be easily verified. We now summarize the above results in the following theorem.
Theorem 1
The optimal value function of the entropyregularized exploratory MV problem (6) is given by
$$V(t,x;w)={(xw)}^{2}{e}^{{\rho}^{\prime}\rho (Tt)}+\frac{\lambda d}{4}{\rho}^{\prime}\rho \left({T}^{2}{t}^{2}\right)\frac{\lambda d}{2}\left({\rho}^{\prime}\rho T\frac{1}{d}\mathrm{ln}\frac{{\sigma}^{\prime}\sigma }{\pi \lambda}\right)(Tt){(wz)}^{2},$$  (13) 
for $\mathrm{(}t\mathrm{,}x\mathrm{)}\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}T\mathrm{]}\mathrm{\times}\mathrm{R}$. Moreover, the optimal feedback control is Gaussian, with its density function given by
$${\bm{\pi}}^{\ast}(u;t,x,w)=\mathcal{N}\left(u{\sigma}^{1}\rho (xw),{\left({\sigma}^{\prime}\sigma \right)}^{1}\frac{\lambda}{2}{e}^{{\rho}^{\prime}\rho (Tt)}\right).$$  (14) 
The associated optimal wealth process under ${\mathbf{\pi}}^{\mathrm{\ast}}$ is the unique solution of the stochastic differential equation
$$d{X}_{t}^{*}={\rho}^{\prime}\rho ({X}_{t}^{*}w)dt+{\left({\rho}^{\prime}\rho {\left({X}_{t}^{*}w\right)}^{2}+\frac{\lambda}{2}{e}^{{\rho}^{\prime}\rho (Tt)}\right)}^{\frac{1}{2}}d{B}_{t},{X}_{0}^{*}={x}_{0}.$$  (15) 
Finally, the Lagrange multiplier $w$ is given by $w\mathrm{=}\frac{z\mathit{}{e}^{{\rho}^{\mathrm{\prime}}\mathit{}\rho \mathit{}T}\mathrm{}{x}_{\mathrm{0}}}{{e}^{{\rho}^{\mathrm{\prime}}\mathit{}\rho \mathit{}T}\mathrm{}\mathrm{1}}$.
Proof. See Appendix C.1.
Theorem 1 indicates that the level of exploration, measured by the variance of Gaussian policy $\frac{\lambda}{2{\sigma}^{2}}{e}^{{\rho}^{2}(Tt)}$, decays in time. The agent initially engages in exploration at the maximum level, and reduces it gradually (although never to zero) as time approaches the end of the investment horizon. Naturally, exploitation dominates exploration as time approaches maturity. Theorem 1 presents such a decaying exploration scheme endogenously which, to our best knowledge, has not been derived in the RL literature.
Moreover, the mean of the Gaussian distribution (14) is independent of the exploration weight $\lambda $, while its variance is independent of the state $x$. This highlights a perfect separation between exploitation and exploration, as the former is captured by the mean and the latter by the variance of the optimal Gaussian exploration. This property is also consistent with the linear–quadratic case in the infinite horizon studied in Wang et al. [2019].
It is reasonable to expect that the exploratory problem converges to its classical counterpart as the exploration weight $\lambda $ decreases to 0. Let ${\bm{u}}^{\ast}$ be the optimal feedback control for the classical MV problem, and denote by ${V}^{\text{cl}}$ the optimal value function. Let ${\delta}_{a}(\cdot )$ be the Dirac measure centered at $a\in {\mathbb{R}}^{d}$. Then the following result holds.
Theorem 2
For each $\mathrm{(}t\mathrm{,}x\mathrm{,}w\mathrm{)}\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}T\mathrm{]}\mathrm{\times}\mathrm{R}\mathrm{\times}\mathrm{R}$,
$$\underset{\lambda \to 0}{lim}{\bm{\pi}}^{\ast}(\cdot ;t,x;w)={\delta}_{{\bm{u}}^{\ast}(t,x;w)}(\cdot )\mathit{\text{weakly.}}$$ 
Moreover,
$$\underset{\lambda \to 0}{lim}V(t,x;w){V}^{\text{\mathit{c}\mathit{l}}}(t,x;w)=0.$$ 
Proof. See Appendix C.2.
4 RL Algorithm Design
4.1 A policy improvement theorem
We present a policy improvement theorem that is a crucial prerequisite for our interpretable RL algorithm, the EMV algorithm, which solves the exploratory MV problem in high dimensions.
Theorem 3 (Policy Improvement Theorem)
Let $w\mathrm{\in}\mathrm{R}$ be fixed and $\mathbf{\pi}\mathrm{=}\mathbf{\pi}\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{;}\mathrm{\cdot}\mathrm{,}\mathrm{\cdot}\mathrm{,}w\mathrm{)}$ be an arbitrarily given admissible feedback control policy. Suppose that the corresponding value function ${V}^{\mathbf{\pi}}\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{,}\mathrm{\cdot}\mathrm{;}w\mathrm{)}\mathrm{\in}{C}^{\mathrm{1}\mathrm{,}\mathrm{2}}\mathit{}\mathrm{(}\mathrm{[}\mathrm{0}\mathrm{,}T\mathrm{)}\mathrm{\times}\mathrm{R}\mathrm{)}\mathrm{\cap}{C}^{\mathrm{0}}\mathit{}\mathrm{(}\mathrm{[}\mathrm{0}\mathrm{,}T\mathrm{]}\mathrm{\times}\mathrm{R}\mathrm{)}$ and satisfies ${V}_{x\mathit{}x}^{\mathbf{\pi}}\mathit{}\mathrm{(}t\mathrm{,}x\mathrm{;}w\mathrm{)}\mathrm{>}\mathrm{0}$, for any $\mathrm{(}t\mathrm{,}x\mathrm{)}\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}T\mathrm{)}\mathrm{\times}\mathrm{R}$. Suppose further that the feedback policy $\stackrel{\mathrm{~}}{\mathbf{\pi}}$ defined by
$$\stackrel{~}{\bm{\pi}}(u;t,x,w)=\mathcal{N}\left(u{\sigma}^{1}\rho \frac{{V}_{x}^{\bm{\pi}}(t,x;w)}{{V}_{xx}^{\bm{\pi}}(t,x;w)},{({\sigma}^{\prime}\sigma )}^{1}\frac{\lambda}{{V}_{xx}^{\bm{\pi}}(t,x;w)}\right)$$  (16) 
is admissible. Then,
$${V}^{\stackrel{~}{\bm{\pi}}}(t,x;w)\le {V}^{\bm{\pi}}(t,x;w),(t,x)\in [0,T]\times \mathbb{R}.$$  (17) 
Proof. See Appendix C.3.
The above theorem suggests that there are always policies in the Gaussian family that improves the value function of any given, not necessarily Gaussian, policy. Moreover, the Gaussian family is closed under the policy improvement scheme. Hence, without loss of generality, we can simply focus on the Gaussian policies when choosing an initial solution. The next result shows convergence of both the value functions and the policies from a specifically parameterized Gaussian policy.
Theorem 4
Let ${\mathbf{\pi}}_{\mathrm{0}}\mathit{}\mathrm{(}u\mathrm{;}t\mathrm{,}x\mathrm{,}w\mathrm{)}\mathrm{=}\mathrm{N}\mathit{}\mathrm{(}u\mathrm{}\alpha \mathit{}\mathrm{(}x\mathrm{}w\mathrm{)}\mathrm{,}\mathrm{\Sigma}\mathit{}{e}^{\beta \mathit{}\mathrm{(}T\mathrm{}t\mathrm{)}}\mathrm{)}$, with $\alpha \mathrm{\in}{\mathrm{R}}^{d}$, $\beta \mathrm{\in}\mathrm{R}$ and $\mathrm{\Sigma}$ being a $d\mathrm{\times}d$ positive definite matrix. Denote by $\mathrm{\{}{\mathbf{\pi}}_{n}\mathit{}\mathrm{(}u\mathrm{;}t\mathrm{,}x\mathrm{,}w\mathrm{)}\mathrm{,}\mathrm{(}t\mathrm{,}x\mathrm{)}\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}T\mathrm{]}\mathrm{\times}\mathrm{R}\mathrm{,}n\mathrm{\ge}\mathrm{1}\mathrm{\}}$ the sequence of feedback policies updated by the policy improvement scheme (16), and $\mathrm{\{}{V}^{{\mathbf{\pi}}_{n}}\mathit{}\mathrm{(}t\mathrm{,}x\mathrm{;}w\mathrm{)}\mathrm{,}\mathrm{(}t\mathrm{,}x\mathrm{)}\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}T\mathrm{]}\mathrm{\times}\mathrm{R}\mathrm{,}n\mathrm{\ge}\mathrm{1}\mathrm{\}}$ the sequence of the corresponding value functions. Then,
$$\underset{n\to \mathrm{\infty}}{lim}{\bm{\pi}}_{n}(\cdot ;t,x,w)={\bm{\pi}}^{\mathbf{*}}(\cdot ;t,x,w)\mathit{\text{weakly,}}$$  (18) 
and
$$\underset{n\to \mathrm{\infty}}{lim}{V}^{{\bm{\pi}}_{n}}(t,x;w)=V(t,x;w),$$  (19) 
for any $\mathrm{(}t\mathrm{,}x\mathrm{,}w\mathrm{)}\mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}T\mathrm{]}\mathrm{\times}\mathrm{R}\mathrm{\times}\mathrm{R}$, where ${\mathbf{\pi}}^{\mathrm{*}}$ and $V$ are the optimal Gaussian policy (14) and the optimal value function (13), respectively.
Proof. See Appendix C.4.
4.2 The EMV algorithm
We provide the EMV algorithm to directly learn the optimal solution of the continuoustime exploratory MV problem in high dimensions within competitive training time. Theorem 3 provides guidance for policy improvement. For the policy evaluation step, we follow Doya [2000] to minimize the continuoustime Bellman’s error
$${\delta}_{t}:={\dot{V}}_{t}^{\bm{\pi}}+\lambda {\int}_{{\mathbb{R}}^{d}}{\pi}_{t}(u)\mathrm{ln}{\pi}_{t}(u)\mathit{d}u,$$  (20) 
where ${\dot{V}}_{t}^{\bm{\pi}}=\frac{{V}^{\bm{\pi}}(t+\mathrm{\Delta}t,{X}_{t+\mathrm{\Delta}t}){V}^{\bm{\pi}}(t,{X}_{t})}{\mathrm{\Delta}t}$ is the total derivative and $\mathrm{\Delta}t$ is the discretization step for the learning algorithm. This leads to the cost function to be minimized
$$C(\theta ,\varphi )=\frac{1}{2}\sum _{({t}_{i},{x}_{i})\in \mathcal{D}}{\left({\dot{V}}^{\theta}({t}_{i},{x}_{i})+\lambda {\int}_{{\mathbb{R}}^{d}}{\pi}_{{t}_{i}}^{\varphi}(u)\mathrm{ln}{\pi}_{{t}_{i}}^{\varphi}(u)\mathit{d}u\right)}^{2}\mathrm{\Delta}t,$$  (21) 
using samples collected in the set $\mathcal{D}$ under the current Gaussian policy ${\bm{\pi}}^{\varphi}$. Here, both the value function ${V}^{\theta}$ and the Gaussian policy ${\pi}^{\varphi}$ can be parametrized more explicitly, in view of (13), Theorem 3 and 4. The cost function (21) can then be minimized by stochastic gradient decent. Finally, the EMV algorithm updates the Lagrange multiplier $w$ every $N$ iterations based on stochastic approximation and the constraint $\mathbb{E}[{X}_{T}^{\pi}]=z$, namely, $w\leftarrow w\alpha (\frac{1}{N}{\sum}_{j}{x}_{T}^{j}z)$, where ${x}_{T}^{j}$’s are the most recent $N$ terminal wealth values. We refer the readers to Wang and Zhou [2019] for a more detailed description of the EMV algorithm in the one risky asset scenario.
5 Empirical Results
5.1 Data and methods
We test the EMV algorithm on price data of the S&P $500$ stocks for both monthly and daily trading. For the former, we train the EMV algorithm on the $10$ years monthly data^{3}^{3} 3 All data is from Wharton Research Data Services (WRDS). https://wrdsweb.wharton.upenn.edu/wrds/ from 08311990 to 08312000, and then test the learned allocation strategy from 09292000 to 09302010. The initial wealth is normalized as $1$ and the $10$ years target is $z=8$, corresponding to a $23\%$ annualized target return. In the daily rebalancing scenario, the EMV algorithm is trained on the $1$ year daily data from 01092017 to 01082018 and tested on the subsequent year, with a $40\%$ return set as the target for the $1$ year investment horizon.
For comparison studies, we also train and test other alternative methods for solving the portfolio allocation problem on the same data. Specifically, we consider the classical econometric methods including BlackLitterman (BL, Black and Litterman [1992]), FamaFrench (FF, Fama and French [1996]) and the Markowitz portfolio (Markowitz, Markowitz [1952]). A recently developed distributionally robust MV strategy, the Robust Wasserstein Profile Inference (RWPI, Blanchet et al. [2018]), is also included. To compare EMV with deep RL method, we adjust DDPG similarly as in Wang and Zhou [2019], so that it can solve the classical MV problem (3). All experiments were performed on a MacBook Air laptop, with DDPG trained using Tensorflow.
5.2 Test I: monthly rebalancing
We first consider $d=20$. By randomly selecting $20$ stocks for each set/seed, we compose $100$ different seeds. The split of training and testing data for EMV and DDPG is fixed as described above, but we consider two types of training. The first training method is batch (offline) RL, where both algorithms are trained for multiple episodes using one seed, following by testing on the subsequent $10$ years data of that seed. The performance is then averaged over the $100$ seeds. Another method is to use all the $100$ seeds and select one seed randomly for each episode during training. Then both algorithms are tested on randomly selected $100$ seeds over the test period and the performance is averaged as well. The second method can be seen to artificially generate randomness for training and testing, and an algorithm that performs well using this method has universality and potential to generate to data of stocks in different sectors.
For competitive performance, we adopt a rollinghorizon based training and testing for all the other methods. Specifically, each time after the $1$ month ahead investment decision is made on the test set, we add the most recent price data point from the test set into the training set, and discard the most obsolete data point from the training set.
Figure 0(a) shows the performance of various investment strategies, including variants of the EMV algorithm with different gross leverage constraints on portfolios.^{4}^{4} 4 Leverage is a fundamental investment tool for most hedge funds; according to Ang et al. [2011], the average gross leverage across the $208$ hedge funds studied therein is $213\%$. Under reasonable leverage constraint, the EMV algorithm still outperforms most other methods (which have no constraints, except DDPG) by a large margin, although it was trained only using the previous $10$ years monthly data.
The universal training and testing method was used for EMV and DDPG in Figure 0(a). Results for the batch method can be found in Appendix D. A remarkable fact in both cases is that the original EMV algorithm, devised to solve the exploratory MV problem (6) without constraint, achieves the target $z=8$ with minimal variance for most of the test period. We also report various investment outcomes in Table 1 when scaling up $d$, the number of stocks in the portfolio.
5.3 Test II: daily rebalancing
For daily trading with $d=50$, we report the performance of the EMV algorithm under different gross leverage constraints in Figure 0(b). The DDPG algorithm was not competitive in the daily trading setting (see Table 1) and, hence, omitted. For different $d$, Table 1 summarizes the investment outcomes and the training time (per experiment). These results were obtained using the universal method for both training and testing.


(a) $10$ years horizon with monthly rebalancing and (b) $1$ year horizon with daily rebalancing.
6 Related Work
The difficulty of seeking the global optimum for Markov Decision Process (MDP) problems under the MV criterion has been previously noted in Mannor and Tsitsiklis [2013]. In fact, the variance of rewardtogo is nonlinear in expectation and, as a result of Bellman’s inconsistency, most of the wellknown RL algorithms cannot be applied directly.
Existing works on variance estimation and control generally divide into value based methods and policy based methods. Sobel [1982] obtained the Bellman’s equation for the variance of rewardtogo under a fixed, given policy. Sato et al. [2001] further derived the TD(0) learning rule to estimate the variance, followed by Sato and Kobayashi [2000] which applied this value based method to an MV portfolio selection problem. It is worth noting that due to the definition of the value function (i.e., the variance penalized expected rewardtogo) in Sato and Kobayashi [2000], Bellman’s optimality principle does not hold. As a result, it is not guaranteed that a greedy policy based on the latest updated value function will eventually lead to the true global optimal policy. The second approach, the policy based RL, was proposed in Tamar et al. [2013]. They also extended the work to linear function approximators and devised actorcritic algorithms for MV optimization problems for which convergence to the local optimum is guaranteed with probability one (Tamar and Mannor [2013]). Related works following this line of research include Prashanth and Ghavamzadeh [2013], Prashanth and Ghavamzadeh [2016], among others. Despite the various methods mentioned above, it remains an open and interesting question in RL to search for the global optimum under the MV criterion.
In this paper, rather than relying on the typical framework of discretetime MDP and discretizing time and state/action spaces accordingly, we designed the EMV algorithm to learn the global optimal solution of the continuoustime exploratory MV problem (6) directly. As pointed out in Doya [2000], it is typically challenging to find the right granularity to discretize the state and action spaces, and naive discretization may lead to poor performance. On the other hand, gridbased discretization methods for solving the HJB equation cannot easily extend to high dimensions in practice due to the curse of dimensionality, although theoretical convergence results have been established (see Munos and Bourgine [1998], Munos [2000]). Our EMV algorithm, however, is computationally feasible and implementable in high dimensions, as demonstrated by the experiements, due to the explicit representations of the value functions and the portfolio strategies, thereby devoid of the curse of dimensionality. Note that our algorithm does not use (deep) neural networks, which have been applied extensively in literature for (highdimensional) continuous RL problems (e.g., Lillicrap et al. [2016], Mnih et al. [2015]) but known for unstable performance, sample inefficiency as well as extensive hyperparameter tuning (Mnih et al. [2015], Duan et al. [2016], Henderson et al. [2018]), in addition to their low interpretability.^{6}^{6} 6 Interpretability is one of the most important and pressing issues in the general artificial intelligence applications in financial industry due to, among others, the regulatory requirement.
7 Conclusions
We studied continuoustime meanvariance (MV) portfolio allocation problem in high dimensions using RL methods. Under the exploratory control framework for general continuoustime optimization problems, we formulated the exploratory MV problem in high dimensions and proved the optimality of Gaussian policy in achieving the best tradeoff between exploration and exploitation. Our EMV algorithm, designed by combining quantitative finance analysis and RL techniques to solve the exploratory MV problem, is interpretable, scalable and data efficient, thanks to a provable policy improvement theorem and efficient functional approximations based on the theoretical optimal solutions. It consistently outperforms both classical modelbased econometric methods and modelfree deep RL method, across different training and testing scenarios. Interesting future research includes testing the EMV algorithm for shorter trading horizons with tick data (e.g. high frequency trading), or for trading other financial instruments such as meanvariance option hedging.
Acknowledgments
The author would like to thank Prof. Xun Yu Zhou for generous support and continuing encouragement on this work. The author also wants to thank Lin (Charles) Chen for providing the results on BL, FF, Markowitz and RWPI methods.
References
 Ang et al. [2011] Andrew Ang, Sergiy Gorovyy, and Gregory B Van Inwegen. Hedge fund leverage. Journal of Financial Economics, 102(1):102–126, 2011.
 Black and Litterman [1992] Fischer Black and Robert Litterman. Global portfolio optimization. Financial Analysts Journal, 48(5):28–43, 1992.
 Blanchet et al. [2018] Jose Blanchet, Lin Chen, and Xun Yu Zhou. Distributionally robust meanvariance portfolio selection with Wasserstein distances. arXiv preprint arXiv:1802.04885, 2018.
 Buehler et al. [2019] Hans Buehler, Lukas Gonon, Josef Teichmann, and Ben Wood. Deep hedging. Quantitative Finance, pages 1–21, 2019.
 Doya [2000] Kenji Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245, 2000.
 Duan et al. [2016] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
 Fama and French [1996] Eugene F Fama and Kenneth R French. Multifactor explanations of asset pricing anomalies. The Journal of Finance, 51(1):55–84, 1996.
 Henderson et al. [2018] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Levine et al. [2016] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 Lillicrap et al. [2016] Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
 Luenberger [1998] David G Luenberger. Investment Science. Oxford University Press, New York, 1998.
 Mannor and Tsitsiklis [2013] Shie Mannor and John N Tsitsiklis. Algorithmic aspects of mean–variance optimization in Markov decision processes. European Journal of Operational Research, 231(3):645–653, 2013.
 Markowitz [1952] Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952.
 Markowitz [1959] Harry Markowitz. Portfolio Selection: Efficient Diversification of Investments. Yale University Press, 1959. ISBN 9780300013726. URL http://www.jstor.org/stable/j.ctt1bh4c8h.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, and Georg Ostrovski. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Moody and Saffell [2001] John Moody and Matthew Saffell. Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks, 12(4):875–889, 2001.
 Moody et al. [1998] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17(56):441–470, 1998.
 Munos [2000] Rémi Munos. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40(3):265–299, 2000.
 Munos and Bourgine [1998] Rémi Munos and Paul Bourgine. Reinforcement learning for continuous stochastic control problems. In Advances in Neural Information Processing Systems, pages 1029–1035, 1998.
 Nevmyvaka et al. [2006] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. Reinforcement learning for optimized trade execution. In Proceedings of the 23rd International Conference on Machine Learning, pages 673–680, 2006.
 Peters et al. [2003] Jan Peters, Sethu Vijayakumar, and Stefan Schaal. Reinforcement learning for humanoid robotics. In Proceedings of the 3rd IEEERAS International Conference on Humanoid Robots, pages 1–20, 2003.
 Prashanth and Ghavamzadeh [2013] LA Prashanth and Mohammad Ghavamzadeh. Actorcritic algorithms for risksensitive MDPs. In Advances In Neural Information Processing Systems, pages 252–260, 2013.
 Prashanth and Ghavamzadeh [2016] LA Prashanth and Mohammad Ghavamzadeh. Varianceconstrained actorcritic algorithms for discounted and average reward MDPs. Machine Learning, 105(3):367–417, 2016.
 Sato and Kobayashi [2000] Makoto Sato and Shigenobu Kobayashi. Variancepenalized reinforcement learning for riskaverse asset allocation. In International Conference on Intelligent Data Engineering and Automated Learning, pages 244–249. Springer, 2000.
 Sato et al. [2001] Makoto Sato, Hajime Kimura, and Shibenobu Kobayashi. TD algorithm for the variance of return and meanvariance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence, 16(3):353–362, 2001.
 Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
 Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, and Adrian Bolton. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
 Sobel [1982] Matthew J Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
 Tamar and Mannor [2013] Aviv Tamar and Shie Mannor. Variance adjusted actor critic algorithms. arXiv preprint arXiv:1310.3697, 2013.
 Tamar et al. [2013] Aviv Tamar, Dotan Di Castro, and Shie Mannor. Temporal difference methods for the variance of the reward to go. In International Conference on Machine Learning, pages 495–503, 2013.
 Wang and Zhou [2019] Haoran Wang and Xun Yu Zhou. Continuoustime meanvariance portfolio selection: A reinforcement learning framework. arXiv preprint arXiv:1904.11392, 2019.
 Wang et al. [2019] Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Exploration versus exploitation in reinforcement learning: A stochastic control approach. arXiv preprint: arXiv:1812.01552v3, 2019.
 Zhou and Li [2000] Xun Yu Zhou and Duan Li. Continuoustime meanvariance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization, 42(1):19–33, 2000.
Appendix A Controlled Wealth Dynamics
Let ${W}_{t}=({W}_{t}^{1},\mathrm{\dots},{W}_{t}^{d})$, $0\le t\le T$ be a standard $d$dimensional Brownian motion defined on a filtered probability space $(\mathrm{\Omega},\mathcal{F},{\{{\mathcal{F}}_{t}\}}_{0\le t\le T},\mathbb{P})$ that satisfies the usual conditions. The price process of the $i$th risky asset is a geometric Brownian motion governed by
$$d{S}_{t}^{i}={S}_{t}^{i}\left({\mu}^{i}dt+{\sigma}^{i}\cdot d{W}_{t}\right),0\le t\le T,i=1,\mathrm{\dots},d,$$  (22) 
with ${S}_{0}^{i}={s}_{0}^{i}>0$ being the initial price at $t=0$, and ${\mu}^{i}\in \mathbb{R}$, ${\sigma}^{i}=({\sigma}^{1i},\mathrm{\dots},{\sigma}^{di})\in {\mathbb{R}}^{d}$ being the mean return and volatility coefficients of the $i$th risky asset, respectively. We denote for brevity the mean return vector by $\mu \in {\mathbb{R}}^{d}$, and the volatility matrix by $\sigma \in {\mathbb{R}}^{d\times d}$, whose $i$th column represents the volatility ${\sigma}^{i}$ of the $i$th risky asset. The riskless asset has a constant interest rate $r>0$. We assume that $\sigma $ is nondegenerate and hence there exists a $d$dimensional vector $\rho $ that satisfies ${\sigma}^{\prime}\rho =\mu r\mathrm{\U0001d7cf}$, where $\mathrm{\U0001d7cf}$ is the $d$dimensional vector with all components being $1$. The vector $\rho $ is known as the market price of risk. It is worth noting that the above assumptions are only made for the convenience of deriving theoretical results in the paper; in practice, all the model parameters are unknown and timevarying, and it is the goal of RL algorithms to directly output trading strategies without relying on estimation of any underlying parameters.
Denote by ${u}_{t}^{0}$ and ${u}_{t}=({u}_{t}^{1},\mathrm{\dots},{u}_{t}^{d})$ the discounted dollar value put in the savings account and the $d$ risky assets, respectively, at time $t$. It then follows that the discounted wealth process is ${x}_{t}^{u}={\sum}_{i=0}^{d}{u}_{t}^{i}$, $0\le t\le T$. The selffinancing condition further implies that, using (22), we have
$$d{x}_{t}^{u}=r{u}_{t}^{0}dt+\sum _{i=1}^{d}\frac{{u}_{t}^{i}}{{S}_{t}^{i}}d{S}_{t}^{i}r{x}_{t}^{u}dt=r({x}_{t}^{u}{u}_{t}^{0})dt+\sum _{i=1}^{d}{u}_{t}^{i}\left({\mu}^{i}dt+{\sigma}^{i}\cdot d{W}_{t}\right)$$ 
$$=\sum _{i=1}^{d}{u}_{t}^{i}\left(({\mu}^{i}r)dt+{\sigma}^{i}\cdot d{W}_{t}\right)=\sigma {u}_{t}\cdot (\rho dt+d{W}_{t}).$$ 
Appendix B Value Functions and Admissible Control Distributions
In order to rigorously solve (6) by dynamic programming, we need to define the value functions. For each $(s,y)\in [0,T)\times \mathbb{R}$, consider the state equation (4) on $[s,T]$ with ${X}_{s}^{\pi}=y$. Define the set of admissible controls, $\mathcal{A}(s,y)$, as follows. Let $\mathcal{B}({\mathbb{R}}^{d})$ be the Borel algebra on ${\mathbb{R}}^{d}$. A (distributional) control (or portfolio/strategy) process $\pi =\{{\pi}_{t},s\le t\le T\}$ belongs to $\mathcal{A}(s,y)$, if
(i) for each $s\le t\le T$, ${\pi}_{t}\in \mathcal{P}({\mathbb{R}}^{d})$ a.s.;
(ii) for each $A\in \mathcal{B}({\mathbb{R}}^{d})$, $\{{\int}_{A}{\pi}_{t}(u)\mathit{d}u,s\le t\le T\}$ is ${\mathcal{F}}_{t}$progressively measurable;
(iii) $$;
(iv) $$.
Clearly, it follows from condition (iii) that the stochastic differential equation (SDE) (4) has a unique strong solution for $s\le t\le T$ that satisfies ${X}_{s}^{\pi}=y$.
Controls in $\mathcal{A}(s,y)$ are measurevalued (or, precisely, densityfunctionvalued) stochastic processes, which are also called openloop controls in the control terminology. As in the classical control theory, it is important to distinguish between openloop controls and feedback (or closedloop) controls (or policies as in the RL literature, or laws as in the control literature). Specifically, a deterministic mapping $\bm{\pi}(\cdot ;\cdot ,\cdot )$ is called an (admissible) feedback control if i) $\bm{\pi}(\cdot ;t,x)$ is a density function for each $(t,x)\in [0,T]\times \mathbb{R}$; ii) for each $(s,y)\in [0,T)\times \mathbb{R}$, the following SDE (which is the system dynamics after the feedback policy $\bm{\pi}(\cdot ;\cdot ,\cdot )$ is applied)
$$d{X}_{t}^{\bm{\pi}}=({\int}_{{\mathbb{R}}^{d}}{\rho}^{\prime}\sigma u\bm{\pi}(u;t,{X}_{t}^{\bm{\pi}}))du)dt+({\int}_{{\mathbb{R}}^{d}}{u}^{\prime}{\sigma}^{\prime}\sigma u\bm{\pi}(u;t,{X}_{t}^{\bm{\pi}}))du){}^{\frac{1}{2}}dB{}_{t},X{}^{\bm{\pi}}{}_{s}=y,$$  (23) 
has a unique strong solution $\{{X}_{t}^{\bm{\pi}},t\in [s,T]\}$, and the openloop control $\pi =\{{\pi}_{t},$ $t\in [s,T]\}\in \mathcal{A}(s,y)$ where ${\pi}_{t}:=\bm{\pi}(\cdot ;t,{X}_{t}^{\bm{\pi}})$. In this case, the openloop control $\pi $ is said to be generated from the feedback policy $\bm{\pi}(\cdot ;\cdot ,\cdot )$ with respect to the initial time and state, $(s,y)$. It is useful to note that an openloop control and its admissibility depend on the initial $(s,y)$, whereas a feedback policy can generate openloop controls for any $(s,y)\in [0,T)\times \mathbb{R}$, and hence is in itself independent of $(s,y)$. Note that throughout this paper, we have used boldfaced $\bm{\pi}$ to denote feedback controls, and the normal style $\pi $ to denote openloop controls.
Now, for a fixed $w\in \mathbb{R}$, define
$$V(s,y;w):=\underset{\pi \in \mathcal{A}(s,y)}{inf}\mathbb{E}[{({X}_{T}^{\pi}w)}^{2}+\lambda {\int}_{0}^{T}{\int}_{{\mathbb{R}}^{d}}{\pi}_{t}(u)\mathrm{ln}{\pi}_{t}(u)dudt{X}_{s}^{\pi}=y]{(wz)}^{2},$$  (24) 
for $(s,y)\in [0,T)\times \mathbb{R}$. The function $V(\cdot ,\cdot ;w)$ is called the optimal value function of the problem.
Moreover, we define the value function under any given feedback control $\bm{\pi}$:
$${V}^{\bm{\pi}}(s,y;w)=\mathbb{E}[{({X}_{T}^{\bm{\pi}}w)}^{2}+\lambda {\int}_{s}^{T}{\int}_{{\mathbb{R}}^{d}}{\pi}_{t}(u)\mathrm{ln}{\pi}_{t}(u)dudt{X}_{s}^{\bm{\pi}}=y]{(wz)}^{2},$$  (25) 
for $(s,y)\in [0,T)\times \mathbb{R}$, where $\pi =\{{\pi}_{t},$ $t\in [s,T]\}$ is the openloop control generated from $\bm{\pi}$ with respect to $(s,y)$ and $\{{X}_{t}^{\bm{\pi}},t\in [s,T]\}$ is the corresponding wealth process.
Note that in the control literature, $V$ given by (24) is called the value function. However, in the RL literature the term “value function" is also used for the objective value under a particular control (i.e. ${V}^{\bm{\pi}}$ in (25)). So to avoid ambiguity we have called $V$ the optimal value function in this paper.
Appendix C Proofs
C.1 Proof of Theorem 1
The main proof of Theroem 1 would be the verification arguments that aim to show the optimal value function of problem (6) is given by (13) and that the candidate optimal policy (14) is indeed admissible, based on the definitions in Appendix B. Since the current exploratory MV problem is a special case of the exploratory linearquadratic problem extensively studied in Wang et al. [2019], a detailed proof would follow the same lines of that of Theorem $4$ therein, and is left for interested readers.
Proof. We now determine the Lagrange multiplier $w$ through the constraint $\mathbb{E}[{X}_{T}^{*}]=z$. It follows from (15), along with the standard estimate that $$ and Fubini’s Theorem, that
$$\mathbb{E}[{X}_{t}^{*}]={x}_{0}+\mathbb{E}\left[{\int}_{0}^{t}{\rho}^{\prime}\rho ({X}_{s}^{*}w)ds\right]={x}_{0}+{\int}_{0}^{t}{\rho}^{\prime}\rho \left(\mathbb{E}[{X}_{s}^{*}]w\right)ds.$$ 
Hence, $\mathbb{E}[{X}_{t}^{*}]=({x}_{0}w){e}^{{\rho}^{\prime}\rho t}+w$. The constraint $\mathbb{E}[{X}_{T}^{*}]=z$ now becomes $({x}_{0}w){e}^{{\rho}^{\prime}\rho T}+w=z$, which gives $w=\frac{z{e}^{{\rho}^{\prime}\rho T}{x}_{0}}{{e}^{{\rho}^{\prime}\rho T}1}$.
C.2 Proof of Theorem 2
To prove the solution of the exploratory MV problem converges to that of the classical MV problem, as $\lambda \to 0$, we first recall the solution of the classical MV problem.
In order to apply dynamic programming for (3), we again consider the set of admissible controls, ${\mathcal{A}}^{\text{cl}}(s,y)$, for $(s,y)\in [0,T)\times \mathbb{R}$,
${\mathcal{A}}^{\text{cl}}(s,y):=\{u=\{{u}_{t},t\in [s,T]\}$: $u$ is ${\mathcal{F}}_{t}$progressively measurable and $$
The (optimal) value function is defined by
$${V}^{\text{cl}}(s,y;w):=\underset{u\in {\mathcal{A}}^{\text{cl}}(s,y)}{inf}\mathbb{E}\left[{({x}_{T}^{u}w)}^{2}\right{x}_{s}^{u}=y]{(wz)}^{2},$$  (26) 
for $(s,y)\in [0,T)\times \mathbb{R}$, where $w\in \mathbb{R}$ is fixed. Once this problem is solved, $w$ can be determined by the constraint $\mathbb{E}[{x}_{T}^{*}]=z$, with $\{{x}_{t}^{*},t\in [0,T]\}$ being the optimal wealth process under the optimal portfolio ${u}^{*}$.
The HJB equation is
$${\omega}_{t}(t,x;w)+\underset{u\in {\mathbb{R}}^{d}}{\mathrm{min}}\left(\frac{1}{2}{u}^{\prime}{\sigma}^{\prime}\sigma u{\omega}_{xx}(t,x;w)+{\rho}^{\prime}\sigma u{\omega}_{x}(t,x;w)\right)=0,(t,x)\in [0,T)\times \mathbb{R},$$  (27) 
with the terminal condition $\omega (T,x;w)={(xw)}^{2}{(wz)}^{2}$.
Standard verification arguments deduce the optimal value function to be
$${V}^{\text{cl}}(t,x;w)={(xw)}^{2}{e}^{{\rho}^{\prime}\rho (Tt)}{(wz)}^{2},$$  (28) 
the optimal feedback control policy to be
$${\bm{u}}^{\ast}(t,x;w)={\sigma}^{1}\rho (xw),$$  (29) 
and the corresponding optimal wealth process to be the unique strong solution to the SDE
$$d{x}_{t}^{*}={\rho}^{\prime}\rho ({x}_{t}^{*}w)dt\rho ({x}_{t}^{*}w)\cdot d{W}_{t},{x}_{0}^{*}={x}_{0}.$$  (30) 
Comparing the optimal wealth dynamics, (15) and (30), of the exploratory and classical problems, we note that they have the same drift coefficient (but different diffusion coefficients). As a result, the two problems have the same mean of optimal terminal wealth and hence the same value of the Lagrange multiplier $w=\frac{z{e}^{{\rho}^{\prime}\rho T}{x}_{0}}{{e}^{{\rho}^{\prime}\rho T}1}$ determined by the constraint $\mathbb{E}[{x}_{T}^{*}]=z$.
Proof. The weak convergence of the feedback controls follows from the explicit forms of ${\bm{\pi}}^{\ast}$ in (14) and ${\bm{u}}^{\ast}$ in (29). The pointwise convergence of the value functions follows easily from the forms of $V$ in (13) and ${V}^{\text{cl}}$ in (28), together with the fact that
$$\underset{\lambda \to 0}{lim}\frac{\lambda}{2}\mathrm{ln}\frac{{\sigma}^{\prime}\sigma }{\pi \lambda}=0.$$ 
C.3 Proof of Theorem 3
Proof. Fix $(t,x)\in [0,T]\times \mathbb{R}$. Since, by assumption, the feedback policy $\stackrel{~}{\bm{\pi}}$ is admissible, the openloop control strategy, $\stackrel{~}{\pi}=\{{\stackrel{~}{\pi}}_{v},v\in [t,T]\}$, generated from $\stackrel{~}{\bm{\pi}}$ with respect to the initial condition ${X}_{t}^{\stackrel{~}{\bm{\pi}}}=x$ is admissible. Let $\{{X}_{s}^{\stackrel{~}{\bm{\pi}}},s\in [t,T]\}$ be the corresponding wealth process under $\stackrel{~}{\pi}$. Applying Itô’s formula, we have
$${V}^{\bm{\pi}}(s,{\stackrel{~}{X}}_{s})={V}^{\bm{\pi}}(t,x)+{\int}_{t}^{s}{V}_{t}^{\bm{\pi}}(v,{X}_{v}^{\stackrel{~}{\bm{\pi}}})dv+{\int}_{t}^{s}{\int}_{{\mathbb{R}}^{d}}(\frac{1}{2}{u}^{\prime}{\sigma}^{\prime}\sigma u{V}_{xx}^{\bm{\pi}}(v,{X}_{v}^{\stackrel{~}{\bm{\pi}}})$$ 
$$+{\rho}^{\prime}\sigma u{V}_{x}^{\bm{\pi}}(v,{X}_{v}^{\stackrel{~}{\bm{\pi}}}))\stackrel{~}{\pi}{}_{v}(u)dudv+\int {}_{t}{}^{s}({\int}_{{\mathbb{R}}^{d}}{u}^{\prime}{\sigma}^{\prime}\sigma u{\stackrel{~}{\pi}}_{v}(u)du){}^{\frac{1}{2}}V{}^{\bm{\pi}}(v,{X}_{v}^{\stackrel{~}{\bm{\pi}}})dB{}_{v},s\in [t,T].$$  (31) 
Define the stopping times ${\tau}_{n}:=inf\{s\ge t:{\int}_{t}^{s}{\int}_{{\mathbb{R}}^{d}}{u}^{\prime}{\sigma}^{\prime}\sigma u{\stackrel{~}{\pi}}_{v}(u)\mathit{d}u{\left({V}^{\bm{\pi}}(v,{X}_{v}^{\stackrel{~}{\bm{\pi}}})\right)}^{2}\mathit{d}v\ge n\}$, for $n\ge 1$. Then, from (31), we obtain
$${V}^{\bm{\pi}}(t,x)=\mathbb{E}[{V}^{\bm{\pi}}(s\wedge {\tau}_{n},{X}_{s\wedge {\tau}_{n}}^{\stackrel{~}{\bm{\pi}}}){\int}_{t}^{s\wedge {\tau}_{n}}{V}_{t}^{\bm{\pi}}(v,{X}_{v}^{\stackrel{~}{\bm{\pi}}})dv$$ 
$${\int}_{t}^{s\wedge {\tau}_{n}}{\int}_{{\mathbb{R}}^{d}}(\frac{1}{2}{u}^{\prime}{\sigma}^{\prime}\sigma u{V}_{xx}^{\bm{\pi}}(v,{X}_{v}^{\stackrel{~}{\bm{\pi}}})+{\rho}^{\prime}\sigma u{V}_{x}^{\bm{\pi}}(v,{X}_{v}^{\stackrel{~}{\bm{\pi}}})){\stackrel{~}{\pi}}_{v}(u)dudv{X}_{t}^{\stackrel{~}{\bm{\pi}}}=x].$$  (32) 
On the other hand, by standard arguments and the assumption that ${V}^{\bm{\pi}}$ is smooth, we have
$${V}_{t}^{\bm{\pi}}(t,x)+{\int}_{{\mathbb{R}}^{d}}\left(\frac{1}{2}{u}^{\prime}{\sigma}^{\prime}\sigma u{V}_{xx}^{\bm{\pi}}(t,x)+{\rho}^{\prime}\sigma u{V}_{x}^{\bm{\pi}}(t,x)+\lambda \mathrm{ln}\bm{\pi}(u;t,x)\right)\bm{\pi}(u;t,x)\mathit{d}u=0,$$ 
for any $(t,x)\in [0,T)\times \mathbb{R}$. It follows that
$${V}_{t}^{\bm{\pi}}(t,x)+\underset{\widehat{\pi}\in \mathcal{P}({\mathbb{R}}^{d})}{\mathrm{min}}{\int}_{{\mathbb{R}}^{d}}\left(\frac{1}{2}{u}^{\prime}{\sigma}^{\prime}\sigma u{V}_{xx}^{\bm{\pi}}(t,x)+{\rho}^{\prime}\sigma u{V}_{x}^{\bm{\pi}}(t,x)+\lambda \mathrm{ln}\widehat{\pi}(u)\right)\widehat{\pi}(u)\mathit{d}u\le 0.$$  (33) 
Notice that the minimizer of the Hamiltonian in (33) is given by the feedback policy $\stackrel{~}{\bm{\pi}}$ in (16). It then follows that equation (32) implies
$${V}^{\bm{\pi}}(t,x)\ge \mathbb{E}[{V}^{\bm{\pi}}(s\wedge {\tau}_{n},{X}_{s\wedge {\tau}_{n}}^{\stackrel{~}{\bm{\pi}}})+\lambda {\int}_{t}^{s\wedge {\tau}_{n}}{\int}_{{\mathbb{R}}^{d}}{\stackrel{~}{\pi}}_{v}(u)\mathrm{ln}{\stackrel{~}{\pi}}_{v}(u)dudv{X}_{t}^{\stackrel{~}{\bm{\pi}}}=x],$$ 
for $(t,x)\in [0,T]\times \mathbb{R}$ and $s\in [t,T]$. Now taking $s=T$, and using that ${V}^{\bm{\pi}}(T,x)={V}^{\stackrel{~}{\bm{\pi}}}(T,x)={(xw)}^{2}{(wz)}^{2}$ together with the assumption that $\stackrel{~}{\pi}$ is admissible, we obtain, by sending $n\to \mathrm{\infty}$ and applying the dominated convergence theorem, that
$${V}^{\bm{\pi}}(t,x)\ge \mathbb{E}[{V}^{\stackrel{~}{\bm{\pi}}}(T,{X}_{T}^{\stackrel{~}{\bm{\pi}}})+\lambda {\int}_{t}^{T}{\int}_{{\mathbb{R}}^{d}}{\stackrel{~}{\pi}}_{v}(u)\mathrm{ln}{\stackrel{~}{\pi}}_{v}(u)dudv{X}_{t}^{\stackrel{~}{\bm{\pi}}}=x]={V}^{\stackrel{~}{\bm{\pi}}}(t,x),$$ 
for any $(t,x)\in [0,T]\times \mathbb{R}$.
C.4 Proof of Theorem 4
Proof. It can be easily verified that the feedback policy ${\bm{\pi}}_{0}(u;t,x,w)=\mathcal{N}(u\alpha (xw),\mathrm{\Sigma}{e}^{\beta (Tt)})$ generates an openloop policy ${\pi}_{0}$ that is admissible with respect to the initial $(t,x)$. Moreover, it follows from the FeynmanKac formula that the corresponding value function ${V}^{{\bm{\pi}}_{0}}$ satisfies the PDE
$${V}_{t}^{{\bm{\pi}}_{0}}(t,x;w)+{\int}_{{\mathbb{R}}^{d}}(\frac{1}{2}{u}^{\prime}{\sigma}^{\prime}\sigma u{V}_{xx}^{{\bm{\pi}}_{0}}(t,x;w)+{\rho}^{\prime}\sigma u{V}_{x}^{{\bm{\pi}}_{0}}(t,x;w)$$ 
$$+\lambda \mathrm{ln}{\pi}_{0}(u;t,x,w))\pi {}_{0}(u;t,x,w)du=0,$$  (34) 
with terminal condition ${V}^{{\bm{\pi}}_{0}}(T,x;w)={(xw)}^{2}{(wz)}^{2}$. Simplifying this equation we obtain
$${V}_{t}^{{\bm{\pi}}_{0}}(t,x;w)+\frac{1}{2}{V}_{xx}^{{\bm{\pi}}_{0}}(t,x;w)\text{Tr}\left(\sigma \alpha {\alpha}^{\prime}\sigma {(xw)}^{2}+\sigma \mathrm{\Sigma}{\sigma}^{\prime}{e}^{\beta (Tt)}\right)$$ 
$$+{V}_{x}^{{\bm{\pi}}_{0}}(t,x;w){\rho}^{\prime}\sigma \alpha (xw)\frac{\lambda}{2}\left(d\mathrm{ln}(2\pi e)+\mathrm{ln}\mathrm{\Sigma}+d\beta (Tt)\right)=0,$$  (35) 
where $\text{Tr}(\cdot )$ denotes the trace of a square matrix. A classical solution to equation (35) is given by
$${V}^{{\bm{\pi}}_{0}}={(xw)}^{2}{e}^{(2{\rho}^{\prime}\sigma \alpha +\text{Tr}(\sigma \alpha {\alpha}^{\prime}{\sigma}^{\prime}))(Tt)}+\frac{\text{Tr}(\sigma \mathrm{\Sigma}{\sigma}^{\prime}){e}^{(\beta +2{\rho}^{\prime}\sigma \alpha +\text{Tr}(\sigma \alpha {\alpha}^{\prime}{\sigma}^{\prime}))(Tt)}}{\beta +2{\rho}^{\prime}\sigma \alpha +\text{Tr}(\sigma \alpha {\alpha}^{\prime}{\sigma}^{\prime})}$$ 
$$\frac{\lambda d}{4}\beta {t}^{2}+\frac{\lambda d}{2}\left(\mathrm{ln}\left(2\pi e{\mathrm{\Sigma}}^{\frac{1}{d}}\right)+\beta T\right)t{(wz)}^{2}\frac{\text{Tr}(\sigma \mathrm{\Sigma}{\sigma}^{\prime})}{\beta +2{\rho}^{\prime}\sigma \alpha +\text{Tr}(\sigma \alpha {\alpha}^{\prime}{\sigma}^{\prime})},$$ 
if $\beta +2{\rho}^{\prime}\sigma \alpha +\text{Tr}(\sigma \alpha {\alpha}^{\prime}{\sigma}^{\prime})\ne 0$ and, by
$${V}^{{\bm{\pi}}_{0}}={(xw)}^{2}{e}^{(2{\rho}^{\prime}\sigma \alpha +\text{Tr}(\sigma \alpha {\alpha}^{\prime}{\sigma}^{\prime}))(Tt)}\frac{\lambda d}{4}\beta {t}^{2}+\left(\frac{\lambda d}{2}\left(\mathrm{ln}\left(2\pi e{\mathrm{\Sigma}}^{\frac{1}{d}}\right)+\beta T\right)\text{Tr}(\sigma \mathrm{\Sigma}{\sigma}^{\prime})\right)t$$ 
$${(wz)}^{2},$$ 
if $\beta +2{\rho}^{\prime}\sigma \alpha +\text{Tr}(\sigma \alpha {\alpha}^{\prime}{\sigma}^{\prime})=0$. In either case, it is easy to check that ${V}^{{\bm{\pi}}_{0}}$ satisfies the conditions in Theorem 3 and, hence, the theorem applies. The improved policy is given by (16), which, in the current case, becomes
$${\bm{\pi}}_{1}(u;t,x,w)=\mathcal{N}\left(u{\sigma}^{1}\rho (xw),\frac{\lambda {({\sigma}^{\prime}\sigma )}^{1}}{2{e}^{(2{\rho}^{\prime}\sigma \alpha +\text{Tr}(\sigma \alpha {\alpha}^{\prime}{\sigma}^{\prime}))(Tt)}}\right).$$ 
Again, we can calculate the corresponding value function as ${V}^{{\bm{\pi}}_{1}}(t,x;w)={(xw)}^{2}{e}^{{\rho}^{\prime}\rho (Tt)}+{F}_{1}(t)$, where ${F}_{1}$ is a function of $t$ only. Theorem 3 is applicable again, which yields the improved policy ${\bm{\pi}}_{2}$ as exactly the optimal Gaussian policy ${\bm{\pi}}^{\mathbf{*}}$ given in (14), together with the optimal value function $V$ in (13). The desired convergence therefore follows, as for $n\ge 2$, both the policy and the value function will no longer strictly improve under the policy improvement scheme (16).
Appendix D Empirical Results: the Batch Method
In Section 5.2, we provided the experiment results for monthly trading under universal training and testing. Another way to train and test the EMV and DDPG algorithms is based on batch (offline) RL, as described in the main text. The batch method applies to the training and testing data from the same set/seed of $d=20$ stocks for each experiment, and the investment performance of the EMV algorithm over the $100$ seeds are reported in Figure 1(a). Due to the extensive training time (see Table 1), we only train and test DDPG under the batch method for $8$ seeds.
The batch method demonstrates qualitatively similar behavior as the universal training and testing method (see Figure 0(a)), when compared to the econometric methods and the deep RL method. A more detailed comparison between the two methods is shown in Figure 1(b). It is interesting to notice that, while the two methods perform equally well for most of the testing period over $20002010$, the universal method is less affected by the $2008$ financial crisis with less variability and higher returns. The batch method, without taking into account the data of other stocks for each portfolio/seed during training and testing, is more susceptible to stock market plunge. Nonetheless, both methods are data efficient, especially in view that, for example, the training set for the batch method contains the same number of data points as the testing set decision making points ($120\times 20$).