Abstract
Reinforcement learning with multiple, potentially conflicting objectives ispervasive in real-world applications, while this problem remains theoreticallyunder-explored. This paper tackles the multi-objective reinforcement learning(MORL) problem and introduces an innovative actor-critic algorithm named MOACwhich finds a policy by iteratively making trade-offs among conflicting rewardsignals. Notably, we provide the first analysis of finite-timePareto-stationary convergence and corresponding sample complexity in bothdiscounted and average reward settings. Our approach has two salient features:(a) MOAC mitigates the cumulative estimation bias resulting from finding anoptimal common gradient descent direction out of stochastic samples. Thisenables provable convergence rate and sample complexity guarantees independentof the number of objectives; (b) With proper momentum coefficient, MOACinitializes the weights of individual policy gradients using samples from theenvironment, instead of manual initialization. This enhances the practicalityand robustness of our algorithm. Finally, experiments conducted on a real-worlddataset validate the effectiveness of our proposed method.