On the Theory of Reinforcement Learning with Once-per-Episode Feedback

Abstract

We study a theory of reinforcement learning (RL) in which the learnerreceives binary feedback only once at the end of an episode. While this is anextreme test case for theory, it is also arguably more representative ofreal-world applications than the traditional requirement in RL practice thatthe learner receive feedback at every time step. Indeed, in many real-worldapplications of reinforcement learning, such as self-driving cars and robotics,it is easier to evaluate whether a learner's complete trajectory was either"good" or "bad," but harder to provide a reward signal at each step. To showthat learning is possible in this more challenging setting, we study the casewhere trajectory labels are generated by an unknown parametric model, andprovide a statistically and computationally efficient algorithm that achievessub-linear regret.

Quick Read (beta)

loading the full paper ...