MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

Abstract

Multi-objective Reinforcement Learning (MORL) seeks to develop policies thatsimultaneously optimize multiple conflicting objectives, but it requiresextensive online interactions. Offline MORL provides a promising solution bytraining on pre-collected datasets to generalize to any preference upondeployment. However, real-world offline datasets are often conservatively andnarrowly distributed, failing to comprehensively cover preferences, leading tothe emergence of out-of-distribution (OOD) preference areas. Existing offlineMORL algorithms exhibit poor generalization to OOD preferences, resulting inpolicies that do not align with preferences. Leveraging the excellentexpressive and generalization capabilities of diffusion models, we proposeMODULI (Multi-objective Diffusion Planner with Sliding Guidance), which employsa preference-conditioned diffusion model as a planner to generate trajectoriesthat align with various preferences and derive action for decision-making. Toachieve accurate generation, MODULI introduces two return normalization methodsunder diverse preferences for refining guidance. To further enhancegeneralization to OOD preferences, MODULI proposes a novel sliding guidancemechanism, which involves training an additional slider adapter to capture thedirection of preference changes. Incorporating the slider, it transitions fromin-distribution (ID) preferences to generating OOD preferences, patching, andextending the incomplete Pareto front. Extensive experiments on the D4MORLbenchmark demonstrate that our algorithm outperforms state-of-the-art OfflineMORL baselines, exhibiting excellent generalization to OOD preferences.

Quick Read (beta)

loading the full paper ...