Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models

Abstract

Exploration is essential for general-purpose robotic learning, especially inopen-ended environments where dense rewards, explicit goals, or task-specificsupervision are scarce. Vision-language models (VLMs), with their semanticreasoning over objects, spatial relations, and potential outcomes, present acompelling foundation for generating high-level exploratory behaviors. However,their outputs are often ungrounded, making it difficult to determine whetherimagined transitions are physically feasible or informative. To bridge the gapbetween imagination and execution, we present IVE (Imagine, Verify, Execute),an agentic exploration framework inspired by human curiosity. Human explorationis often driven by the desire to discover novel scene configurations and todeepen understanding of the environment. Similarly, IVE leverages VLMs toabstract RGB-D observations into semantic scene graphs, imagine novel scenes,predict their physical plausibility, and generate executable skill sequencesthrough action tools. We evaluate IVE in both simulated and real-world tabletopenvironments. The results show that IVE enables more diverse and meaningfulexploration than RL baselines, as evidenced by a 4.1 to 7.8x increase in theentropy of visited states. Moreover, the collected experience supportsdownstream learning, producing policies that closely match or exceed theperformance of those trained on human-collected demonstrations.

Quick Read (beta)

loading the full paper ...