Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic

Abstract

Reinforcement learning (RL) has achieved remarkable performance in a varietyof sequential decision making and control tasks. However, a common problem isthat learned nearly optimal policy always overfits to the training environmentand may not be extended to situations never encountered during training. Forpractical applications, the randomness of the environment usually leads to rarebut devastating events, which should be the focus of safety-critical systems,such as autonomous driving. In this paper, we introduce the minimax formulationand distributional framework to improve the generalization ability of RLalgorithms and develop the Minimax Distributional Soft Actor-Critic (MinimaxDSAC) algorithm. Minimax formulation aims to seek optimal policy consideringthe most serious disturbances from environment, in which the protagonist policymaximizes action-value function while the adversary policy tries to minimizeit. Distributional framework aims to learn a state-action return distribution,from which we can model the risk of different returns explicitly, thus,formulating a risk-averse protagonist policy and a risk-seeking adversarialpolicy. We implement our method on the decision-making tasks of autonomousvehicles at intersections and test the trained policy in distinct environmentsfrom training environment. Results demonstrate that our method can greatlyimprove the generalization ability of the protagonist agent to differentenvironmental variations.

Quick Read (beta)

loading the full paper ...