Reinforcement Learning with Segment Feedback

Abstract

Standard reinforcement learning (RL) assumes that an agent can observe areward for each state-action pair. However, in practical applications, it isoften difficult and costly to collect a reward for each state-action pair.While there have been several works considering RL with trajectory feedback, itis unclear if trajectory feedback is inefficient for learning when trajectoriesare long. In this work, we consider a model named RL with segment feedback,which offers a general paradigm filling the gap between per-state-actionfeedback and trajectory feedback. In this model, we consider an episodic Markovdecision process (MDP), where each episode is divided into $m$ segments, andthe agent observes reward feedback only at the end of each segment. Under thismodel, we study two popular feedback settings: binary feedback and sumfeedback, where the agent observes a binary outcome and a reward sum accordingto the underlying reward function, respectively. To investigate the impact ofthe number of segments $m$ on learning performance, we design efficientalgorithms and establish regret upper and lower bounds for both feedbacksettings. Our theoretical and experimental results show that: under binaryfeedback, increasing the number of segments $m$ decreases the regret at anexponential rate; in contrast, surprisingly, under sum feedback, increasing $m$does not reduce the regret significantly.

Quick Read (beta)

loading the full paper ...