Abstract
Executing language-conditioned tasks in dynamic visual environments remains acentral challenge in embodied AI. Existing Vision-Language-Action (VLA) modelspredominantly adopt reactive state-to-action mappings, often leading toshort-sighted behaviors and poor robustness in dynamic scenes. In this paper,we introduce F1, a pretrained VLA framework which integrates the visualforesight generation into decision-making pipeline. F1 adopts aMixture-of-Transformer architecture with dedicated modules for perception,foresight generation, and control, thereby bridging understanding, generation,and actions. At its core, F1 employs a next-scale prediction mechanism tosynthesize goal-conditioned visual foresight as explicit planning targets. Byforecasting plausible future visual states, F1 reformulates action generationas a foresight-guided inverse dynamics problem, enabling actions thatimplicitly achieve visual goals. To endow F1 with robust and generalizablecapabilities, we propose a three-stage training recipe on an extensive datasetcomprising over 330k trajectories across 136 diverse tasks. This trainingscheme enhances modular reasoning and equips the model with transferable visualforesight, which is critical for complex and dynamic environments. Extensiveevaluations on real-world tasks and simulation benchmarks demonstrate F1consistently outperforms existing approaches, achieving substantial gains inboth task success rate and generalization ability.