Abstract
Large vision-language models have recently demonstrated impressiveperformance in planning and control tasks, driving interest in theirapplication to real-world robotics. However, deploying these models forreasoning in embodied contexts is limited by their ability to incorporatelong-term experience collected across multiple days and represented by vastcollections of images. Current VLMs typically struggle to process more than afew hundred images concurrently, highlighting the need for more efficientmechanisms to handle long-term memory in embodied settings. To effectivelyevaluate these models for long-horizon control, a benchmark must specificallytarget scenarios where memory is crucial for success. Existing long-video QAbenchmarks overlook embodied challenges like object manipulation andnavigation, which demand low-level skills and fine-grained reasoning over pastinteractions. Moreover, effective memory integration in embodied agentsinvolves both recalling relevant historical information and executing actionsbased on that information, making it essential to study these aspects togetherrather than in isolation. In this work, we introduce a new benchmark forlong-range embodied tasks in the Habitat simulator. This benchmark evaluatesmemory-based capabilities across 60 tasks requiring sustained engagement andcontextual awareness in an environment. The tasks can also be procedurallyextended to longer and more challenging versions, enabling scalable evaluationof memory and reasoning. We also present baselines that integratestate-of-the-art VLMs with low level navigation policies, assessing theirperformance on these memory-intensive tasks and highlight areas forimprovement.