Abstract
Reinforcement learning algorithms typically assume rewards to be sampled fromlight-tailed distributions, such as Gaussian or bounded. However, a widevariety of real-world systems generate rewards that follow heavy-taileddistributions. We consider such scenarios in the setting of undiscountedreinforcement learning. By constructing a lower bound, we show that thedifficulty of learning heavy-tailed rewards asymptotically dominates thedifficulty of learning transition probabilities. Leveraging techniques fromrobust mean estimation, we propose Heavy-UCRL2 and Heavy-Q-Learning, and showthat they achieve near-optimal regret bounds in this setting. Our algorithmsalso naturally generalize to deep reinforcement learning applications; weinstantiate Heavy-DQN as an example of this. We demonstrate that all of ouralgorithms outperform baselines on both synthetic MDPs and standard RLbenchmarks.