Abstract
Reinforcement learning algorithms are typically designed for discrete-timedynamics, even though the underlying real-world control systems are oftencontinuous in time. In this paper, we study the problem of continuous-timereinforcement learning, where the unknown system dynamics are represented usingnonlinear ordinary differential equations (ODEs). We leverage probabilisticmodels, such as Gaussian processes and Bayesian neural networks, to learn anuncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedilymaximizes a weighted sum of the extrinsic reward and model epistemicuncertainty. This yields a scalable and sample-efficient approach tocontinuous-time model-based RL. We show that COMBRL achieves sublinear regretin the reward-driven setting, and in the unsupervised RL setting (i.e., withoutextrinsic rewards), we provide a sample complexity bound. In our experiments,we evaluate COMBRL in both standard and unsupervised RL settings anddemonstrate that it scales better, is more sample-efficient than prior methods,and outperforms baselines across several deep RL tasks.