Exploration versus exploitation in reinforcement learning: a stochastic control approach

Abstract

We consider reinforcement learning (RL) in continuous time and study theproblem of achieving the best trade-off between exploration of a black boxenvironment and exploitation of current knowledge. We propose anentropy-regularized reward function involving the differential entropy of thedistributions of actions, and motivate and devise an exploratory formulationfor the feature dynamics that captures repetitive learning under exploration.The resulting optimization problem is a resurrection of the classical relaxedstochastic control. We carry out a complete analysis of the problem in thelinear--quadratic (LQ) case and deduce that the optimal control distributionfor balancing exploitation and exploration is Gaussian. This in turn interpretsand justifies the widely adopted Gaussian exploration in RL, beyond itssimplicity for sampling. Moreover, the exploitation and exploration arereflected respectively by the mean and variance of the Gaussian distribution.We also find that a more random environment contains more learningopportunities in the sense that less exploration is needed, other things beingequal. As the weight of exploration decays to zero, we prove the convergence ofthe solution to the entropy-regularized LQ problem to that of the classical LQproblem. Finally, we characterize the cost of exploration, which is shown to beproportional to the entropy regularization weight and inversely proportional tothe discount rate in the LQ case.

Quick Read (beta)

loading the full paper ...