Abstract
Multi-objective reinforcement learning (MORL) approaches have emerged totackle many real-world problems with multiple conflicting objectives bymaximizing a joint objective function weighted by a preference vector. Theseapproaches find fixed customized policies corresponding to preference vectorsspecified during training. However, the design constraints and objectivestypically change dynamically in real-life scenarios. Furthermore, storing apolicy for each potential preference is not scalable. Hence, obtaining a set ofPareto front solutions for the entire preference space in a given domain with asingle training is critical. To this end, we propose a novel MORL algorithmthat trains a single universal network to cover the entire preference spacescalable to continuous robotic tasks. The proposed approach, Preference-DrivenMORL (PD-MORL), utilizes the preferences as guidance to update the networkparameters. It also employs a novel parallelization approach to increase sampleefficiency. We show that PD-MORL achieves up to 25% larger hypervolume forchallenging continuous control tasks and uses an order of magnitude fewertrainable parameters compared to prior approaches.