Efficient Reinforcement Learning via Decoupling Exploration and Utilization

Abstract

Reinforcement Learning (RL), recognized as an efficient learning approach,has achieved remarkable success across multiple fields and applications,including gaming, robotics, and autonomous vehicles. Classical single-agentreinforcement learning grapples with the imbalance of exploration andexploitation as well as limited generalization abilities. This methodologyfrequently leads to algorithms settling for suboptimal solutions that aretailored only to specific datasets. In this work, our aim is to train agentwith efficient learning by decoupling exploration and utilization, so thatagent can escaping the conundrum of suboptimal Solutions. In reinforcementlearning, the previously imposed pessimistic punitive measures have deprivedthe model of its exploratory potential, resulting in diminished explorationcapabilities. To address this, we have introduced an additional optimisticActor to enhance the model's exploration ability, while employing a moreconstrained pessimistic Actor for performance evaluation. The above idea isimplemented in the proposed OPARL (Optimistic and Pessimistic ActorReinforcement Learning) algorithm. This unique amalgamation within thereinforcement learning paradigm fosters a more balanced and efficient approach.It facilitates the optimization of policies that concentrate on high-rewardactions via pessimistic exploitation strategies while concurrently ensuringextensive state coverage through optimistic exploration. Empirical andtheoretical investigations demonstrate that OPARL enhances agent capabilitiesin both utilization and exploration. In the most tasks of DMControl benchmarkand Mujoco environment, OPARL performed better than state-of-the-art methods.Our code has released on https://github.com/yydsok/OPARL

Quick Read (beta)

loading the full paper ...