On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration

  • 2025-10-21 14:03:30
  • Yirui Zhou, Yunfei Jin, Xiaowei Liu, Xiaofeng Zhang, Yangchun Zhang
  • 0

Abstract

Learning from observations (LfO) replicates expert behavior without needingaccess to the expert's actions, making it more practical than learning fromdemonstrations (LfD) in many real-world scenarios. However, directly applyingthe on-policy training scheme in LfO worsens the sample inefficiency problem,while employing the traditional off-policy training scheme in LfO magnifies theinstability issue. This paper seeks to develop an efficient and stable solutionfor the LfO problem. Specifically, we begin by exploring the generalizationcapabilities of both the reward function and policy in LfO, which provides atheoretical foundation for computation. Building on this, we modify the policyoptimization method in generative adversarial imitation from observation(GAIfO) with distributional soft actor-critic (DSAC), and propose the MimickingObservations through Distributional Update Learning with adequate Exploration(MODULE) algorithm to solve the LfO problem. MODULE incorporates the advantagesof (1) high sample efficiency and training robustness enhancement in softactor-critic (SAC), and (2) training stability in distributional reinforcementlearning (RL). Extensive experiments in MuJoCo environments showcase thesuperior performance of MODULE over current LfO methods.

 

Quick Read (beta)

loading the full paper ...