MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Abstract

Temporal context is essential for robotic manipulation because such tasks areinherently non-Markovian, yet mainstream VLA models typically overlook it andstruggle with long-horizon, temporally dependent tasks. Cognitive sciencesuggests that humans rely on working memory to buffer short-livedrepresentations for immediate control, while the hippocampal system preservesverbatim episodic details and semantic gist of past experience for long-termmemory. Inspired by these mechanisms, we propose MemoryVLA, aCognition-Memory-Action framework for long-horizon robotic manipulation. Apretrained VLM encodes the observation into perceptual and cognitive tokensthat form working memory, while a Perceptual-Cognitive Memory Bank storeslow-level details and high-level semantics consolidated from it. Working memoryretrieves decision-relevant entries from the bank, adaptively fuses them withcurrent tokens, and updates the bank by merging redundancies. Using thesetokens, a memory-conditioned diffusion action expert yields temporally awareaction sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasksacross three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, itachieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperformingstate-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain onBridge. On 12 real-world tasks spanning general skills and long-horizontemporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizontasks showing a +26 improvement over state-of-the-art baseline. Project Page:https://shihao1895.github.io/MemoryVLA

Quick Read (beta)

loading the full paper ...