Abstract
In Multi-objective Reinforcement Learning (MORL) agents are tasked withoptimising decision-making behaviours that trade-off between multiple, possiblyconflicting, objectives. MORL based on decomposition is a family of solutionmethods that employ a number of utility functions to decompose themulti-objective problem into individual single-objective problems solvedsimultaneously in order to approximate a Pareto front of policies. We focus onthe case of linear utility functions parameterised by weight vectors w. Weintroduce a method based on Upper Confidence Bound to efficiently search forthe most promising weight vectors during different stages of the learningprocess, with the aim of maximising the hypervolume of the resulting Paretofront. The proposed method is shown to outperform various MORL baselines onMujoco benchmark problems across different random seeds. The code is online at:https://github.com/SYCAMORE-1/ucb-MOPPO.