Abstract
Learning from observations (LfO) replicates expert behavior without needingaccess to the expert's actions, making it more practical than learning fromdemonstrations (LfD) in many real-world scenarios. However, directly applyingthe on-policy training scheme in LfO worsens the sample inefficiency problem,while employing the traditional off-policy training scheme in LfO magnifies theinstability issue. This paper seeks to develop an efficient and stable solutionfor the LfO problem. Specifically, we begin by exploring the generalizationcapabilities of both the reward function and policy in LfO, which provides atheoretical foundation for computation. Building on this, we modify the policyoptimization method in generative adversarial imitation from observation(GAIfO) with distributional soft actor-critic (DSAC), and propose the MimickingObservations through Distributional Update Learning with adequate Exploration(MODULE) algorithm to solve the LfO problem. MODULE incorporates the advantagesof (1) high sample efficiency and training robustness enhancement in softactor-critic (SAC), and (2) training stability in distributional reinforcementlearning (RL). Extensive experiments in MuJoCo environments showcase thesuperior performance of MODULE over current LfO methods.