A Dynamic Penalty Function Approach for Constraints-Handling in Reinforcement Learning

Abstract

Reinforcement learning (RL) is attracting attention as an effective way tosolve sequential optimization problems that involve high dimensionalstate/action space and stochastic uncertainties. Many such problems involveconstraints expressed by inequality constraints. This study focuses on using RLto solve constrained optimal control problems. Most RL application studies havedealt with inequality constraints by adding soft penalty terms for violatingthe constraints to the reward function. However, while training neural networksto learn the value (or Q) function, one can run into computational issuescaused by the sharp change in the function value at the constraint boundary dueto the large penalty imposed. This difficulty during training can lead toconvergence problems and ultimately lead to poor closed-loop performance. Toaddress this issue, this study proposes a dynamic penalty (DP) approach wherethe penalty factor is gradually and systematically increased during training asthe iteration episodes proceed. We first examine the ability of a neuralnetwork to represent a value function when uniform, linear, or DP functions areadded to prevent constraint violation. The agent trained by a Deep Q Network(DQN) algorithm with the DP function approach was compared with agents withother constant penalty functions in a simple vehicle control problem. Resultsshow that the proposed approach can improve the neural network approximationaccuracy and provide faster convergence when close to a solution.

Quick Read (beta)

loading the full paper ...