Abstract
Reinforcement learning (RL) with delays is challenging as sensory perceptionslag behind the actual events: the RL agent needs to estimate the real state ofits environment based on past observations. State-of-the-art (SOTA) methodstypically employ recursive, step-by-step forecasting of states. This can causethe accumulation of compounding errors. To tackle this problem, our novelbelief estimation method, named Directly Forecasting Belief Transformer (DFBT),directly forecasts states from observations without incrementally estimatingintermediate states step-by-step. We theoretically demonstrate that DFBTgreatly reduces compounding errors of existing recursively forecasting methods,yielding stronger performance guarantees. In experiments with D4RL offlinedatasets, DFBT reduces compounding errors with remarkable prediction accuracy.DFBT's capability to forecast state sequences also facilitates multi-stepbootstrapping, thus greatly improving learning efficiency. On the MuJoCobenchmark, our DFBT-based method substantially outperforms SOTA baselines.