Abstract
Reinforcement learning has been successful across several applications inwhich agents have to learn to act in environments with sparse feedback.However, despite this empirical success there is still a lack of theoreticalunderstanding of how the parameters of reinforcement learning models and thefeatures used to represent states interact to control the dynamics of learning.In this work, we use concepts from statistical physics, to study the typicalcase learning curves for temporal difference learning of a value function withlinear function approximators. Our theory is derived under a Gaussianequivalence hypothesis where averages over the random trajectories are replacedwith temporally correlated Gaussian feature averages and we validate ourassumptions on small scale Markov Decision Processes. We find that thestochastic semi-gradient noise due to subsampling the space of possibleepisodes leads to significant plateaus in the value error, unlike intraditional gradient descent dynamics. We study how learning dynamics andplateaus depend on feature structure, learning rate, discount factor, andreward function. We then analyze how strategies like learning rate annealingand reward shaping can favorably alter learning dynamics and plateaus. Toconclude, our work introduces new tools to open a new direction towardsdeveloping a theory of learning dynamics in reinforcement learning.