Sequence Compression Speeds Up Credit Assignment in Reinforcement Learning

Abstract

Temporal credit assignment in reinforcement learning is challenging due todelayed and stochastic outcomes. Monte Carlo targets can bridge long delaysbetween action and consequence but lead to high-variance targets due tostochasticity. Temporal difference (TD) learning uses bootstrapping to overcomevariance but introduces a bias that can only be corrected through manyiterations. TD($\lambda$) provides a mechanism to navigate this bias-variancetradeoff smoothly. Appropriately selecting $\lambda$ can significantly improveperformance. Here, we propose Chunked-TD, which uses predicted probabilities oftransitions from a model for computing $\lambda$-return targets. Unlike othermodel-based solutions to credit assignment, Chunked-TD is less vulnerable tomodel inaccuracies. Our approach is motivated by the principle of historycompression and 'chunks' trajectories for conventional TD learning. Chunkingwith learned world models compresses near-deterministic regions of theenvironment-policy interaction to speed up credit assignment while stillbootstrapping when necessary. We propose algorithms that can be implementedonline and show that they solve some problems much faster than conventionalTD($\lambda$).

Quick Read (beta)

loading the full paper ...