Abstract
We study reinforcement learning (RL) for a class of continuous-timelinear-quadratic (LQ) control problems for diffusions, where states arescalar-valued and running control rewards are absent but volatilities of thestate processes depend on both state and control variables. We apply amodel-free approach that relies neither on knowledge of model parameters nor ontheir estimations, and devise an RL algorithm to learn the optimal policyparameter directly. Our main contributions include the introduction of anexploration schedule and a regret analysis of the proposed algorithm. Weprovide the convergence rate of the policy parameter to the optimal one, andprove that the algorithm achieves a regret bound of $O(N^{\frac{3}{4}})$ up toa logarithmic factor, where $N$ is the number of learning episodes. We conducta simulation study to validate the theoretical results and demonstrate theeffectiveness and reliability of the proposed algorithm. We also performnumerical comparisons between our method and those of the recent model-basedstochastic LQ RL studies adapted to the state- and control-dependent volatilitysetting, demonstrating a better performance of the former in terms of regretbounds.