Abstract
We study continuous-time reinforcement learning (RL) for stochastic controlin which system dynamics are governed by jump-diffusion processes. We formulatean entropy-regularized exploratory control problem with stochastic policies tocapture the exploration--exploitation balance essential for RL. Unlike the purediffusion case initially studied by Wang et al. (2020), the derivation of theexploratory dynamics under jump-diffusions calls for a careful formulation ofthe jump part. Through a theoretical analysis, we find that one can simply usethe same policy evaluation and $q$-learning algorithms in Jia and Zhou (2022a,2023), originally developed for controlled diffusions, without needing to checka priori whether the underlying data come from a pure diffusion or ajump-diffusion. However, we show that the presence of jumps ought to affectparameterizations of actors and critics in general. We investigate as anapplication the mean--variance portfolio selection problem with stock pricemodelled as a jump-diffusion, and show that both RL algorithms andparameterizations are invariant with respect to jumps. Finally, we present adetailed study on applying the general theory to option hedging.