ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models

Abstract

Reinforcement Learning (RL) heavily relies on the careful design of thereward function. However, accurately assigning rewards to each state-actionpair in Long-Term Reinforcement Learning (LTRL) tasks remains a significantchallenge. As a result, RL agents are often trained under expert guidance.Inspired by the ordinal utility theory in economics, we propose a novel rewardestimation algorithm: ELO-Rating based Reinforcement Learning (ERRL). Thisapproach features two key contributions. First, it uses expert preferences overtrajectories rather than cardinal rewards (utilities) to compute the ELO ratingof each trajectory as its reward. Second, a new reward redistribution algorithmis introduced to alleviate training instability in the absence of a fixedanchor reward. In long-term scenarios (up to 5000 steps), where traditional RLalgorithms struggle, our method outperforms several state-of-the-art baselines.Additionally, we conduct a comprehensive analysis of how expert preferencesinfluence the results.

Quick Read (beta)

loading the full paper ...