Refined Policy Distillation: From VLA Generalists to RL Experts

Abstract

Vision-Language-Action Models (VLAs) have demonstrated remarkablegeneralization capabilities in real-world experiments. However, their successrates are often not on par with expert policies, and they require fine-tuningwhen the setup changes. In this work, we introduce Refined Policy Distillation(RPD), a novel Reinforcement Learning (RL)-based policy refinement method thatbridges this performance gap through a combination of on-policy RL withbehavioral cloning. The core idea of RPD is to distill and refine VLAs intocompact, high-performing expert policies by guiding the student policy duringRL exploration using the actions of a teacher VLA, resulting in increasedsample efficiency and faster convergence. We complement our method byfine-tuned versions of Octo and OpenVLA for ManiSkill3 to evaluate RPD insimulation. While this is a key requirement for applying RL, it also yields newinsights beyond existing studies on VLA performance in real-world settings. Ourexperimental results across various manipulation tasks show that RPD enablesthe RL student to learn expert policies that outperform the VLA teacher in bothdense and sparse reward settings, while also achieving faster convergence thanthe RL baseline. Our approach is even robust to changes in camera perspectiveand can generalize to task variations that the underlying VLA cannot solve. Ourcode, dataset, VLA checkpoints, and videos are available athttps://refined-policy-distillation.github.io

Quick Read (beta)

loading the full paper ...