Abstract
In Multi-objective Reinforcement Learning (MORL) agents are tasked withoptimising decision-making behaviours that trade-off between multiple, possiblyconflicting, objectives. MORL based on decomposition is a family of solutionmethods that employ a number of utility functions to decompose themulti-objective problem into individual single-objective problems solvedsimultaneously in order to approximate a Pareto front of policies. We focus onthe case of linear utility functions parametrised by weight vectors w. Weintroduce a method based on Upper Confidence Bound to efficiently search forthe most promising weight vectors during different stages of the learningprocess, with the aim of maximising the hypervolume of the resulting Paretofront. The proposed method demonstrates consistency and strong performanceacross various MORL baselines on Mujoco benchmark problems. The code isreleased in: https://github.com/SYCAMORE-1/ucb-MOPPO