Abstract
We present PoliFormer (Policy Transformer), an RGB-only indoor navigationagent trained end-to-end with reinforcement learning at scale that generalizesto the real-world without adaptation despite being trained purely insimulation. PoliFormer uses a foundational vision transformer encoder with acausal transformer decoder enabling long-term memory and reasoning. It istrained for hundreds of millions of interactions across diverse environments,leveraging parallelized, multi-machine rollouts for efficient training withhigh throughput. PoliFormer is a masterful navigator, producingstate-of-the-art results across two distinct embodiments, the LoCoBot andStretch RE-1 robots, and four navigation benchmarks. It breaks through theplateaus of previous work, achieving an unprecedented 85.5% success rate inobject goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement.PoliFormer can also be trivially extended to a variety of downstreamapplications such as object tracking, multi-object navigation, andopen-vocabulary navigation with no finetuning.