Abstract
We propose to solve large scale Markowitz mean-variance (MV) portfolioallocation problem using reinforcement learning (RL). By adopting the recentlydeveloped continuous-time exploratory control framework, we formulate theexploratory MV problem in high dimensions. We further show the optimality of amultivariate Gaussian feedback policy, with time-decaying variance, in tradingoff exploration and exploitation. Based on a provable policy improvementtheorem, we devise a scalable and data-efficient RL algorithm and conduct largescale empirical tests using data from the S&P 500 stocks. We found that ourmethod consistently achieves over 10% annualized returns and it outperformseconometric methods and the deep RL method by large margins, for both long andmedium terms of investment with monthly and daily trading.