Abstract
Humans excel at performing complex tasks by leveraging long-term memoryacross temporal and spatial experiences. In contrast, current Large LanguageModels (LLMs) struggle to effectively plan and act in dynamic, multi-room 3Denvironments. We posit that part of this limitation is due to the lack ofproper 3D spatial-temporal memory modeling in LLMs. To address this, we firstintroduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000trajectories and 2,892 embodied tasks, question-answering and captioning,designed to evaluate an agent's ability to reason over long-term memory in 3Denvironments. Second, we propose 3DLLM-Mem, a novel dynamic memory managementand fusion model for embodied spatial-temporal reasoning and actions in LLMs.Our model uses working memory tokens, which represents current observations, asqueries to selectively attend to and fuse the most useful spatial and temporalfeatures from episodic memory, which stores past observations and interactions.Our approach allows the agent to focus on task-relevant information whilemaintaining memory efficiency in complex, long-horizon environments.Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-artperformance across various tasks, outperforming the strongest baselines by16.5% in success rate on 3DMem-Bench's most challenging in-the-wild embodiedtasks.