OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Abstract

Vision-Language-Action (VLA) models aim to predict robotic actions based onvisual observations and language instructions. Existing approaches requirefine-tuning pre-trained visionlanguage models (VLMs) as visual and languagefeatures are independently fed into downstream policies, degrading thepre-trained semantic alignments. We propose OTTER, a novel VLA architecturethat leverages these existing alignments through explicit, text-aware visualfeature extraction. Instead of processing all visual features, OTTERselectively extracts and passes only task-relevant visual features that aresemantically aligned with the language instruction to the policy transformer.This allows OTTER to keep the pre-trained vision-language encoders frozen.Thereby, OTTER preserves and utilizes the rich semantic understanding learnedfrom large-scale pre-training, enabling strong zero-shot generalizationcapabilities. In simulation and real-world experiments, OTTER significantlyoutperforms existing VLA models, demonstrating strong zeroshot generalizationto novel objects and environments. Video, code, checkpoints, and dataset:https://ottervla.github.io/.

Quick Read (beta)

loading the full paper ...