Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning

Abstract

Reinforcement Learning (RL) empowers agents to acquire various skills bylearning from reward signals. Unfortunately, designing high-qualityinstance-level rewards often demands significant effort. An emergingalternative, RL with delayed reward, focuses on learning from rewards presentedperiodically, which can be obtained from human evaluators assessing the agent'sperformance over sequences of behaviors. However, traditional methods in thisdomain assume the existence of underlying Markovian rewards and that theobserved delayed reward is simply the sum of instance-level rewards, both ofwhich often do not align well with real-world scenarios. In this paper, weintroduce the problem of RL from Composite Delayed Reward (RLCoDe), whichgeneralizes traditional RL from delayed rewards by eliminating the strongassumption. We suggest that the delayed reward may arise from a more complexstructure reflecting the overall contribution of the sequence. To address thisproblem, we present a framework for modeling composite delayed rewards, using aweighted sum of non-Markovian components to capture the different contributionsof individual steps. Building on this framework, we propose Composite DelayedReward Transformer (CoDeTr), which incorporates a specialized in-sequenceattention mechanism to effectively model these contributions. We conductexperiments on challenging locomotion tasks where the agent receives delayedrewards computed from composite functions of observable step rewards. Theexperimental results indicate that CoDeTr consistently outperforms baselinemethods across evaluated metrics. Additionally, we demonstrate that iteffectively identifies the most significant time steps within the sequence andaccurately predicts rewards that closely reflect the environment feedback.

Quick Read (beta)

loading the full paper ...