Abstract
Offline reinforcement learning (RL) seeks to learn optimal policies fromstatic datasets without interacting with the environment. A common challenge ishandling multi-modal action distributions, where multiple behaviours arerepresented in the data. Existing methods often assume unimodal behaviourpolicies, leading to suboptimal performance when this assumption is violated.We propose weighted imitation Learning on One Mode (LOM), a novel approach thatfocuses on learning from a single, promising mode of the behaviour policy. Byusing a Gaussian mixture model to identify modes and selecting the best modebased on expected returns, LOM avoids the pitfalls of averaging overconflicting actions. Theoretically, we show that LOM improves performance whilemaintaining simplicity in policy learning. Empirically, LOM outperformsexisting methods on standard D4RL benchmarks and demonstrates its effectivenessin complex, multi-modal scenarios.