LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Abstract

Recent large language model (LLM)-driven chat assistant systems haveintegrated memory components to track user-assistant chat histories, enablingmore accurate and personalized responses. However, their long-term memorycapabilities in sustained interactions remain underexplored. This paperintroduces LongMemEval, a comprehensive benchmark designed to evaluate fivecore long-term memory abilities of chat assistants: information extraction,multi-session reasoning, temporal reasoning, knowledge updates, and abstention.With 500 meticulously curated questions embedded within freely scalableuser-assistant chat histories, LongMemEval presents a significant challenge toexisting long-term memory systems, with commercial chat assistants andlong-context LLMs showing 30% accuracy drop on memorizing information acrosssustained interactions. We then present a unified framework that breaks downthe long-term memory design into four design choices across the indexing,retrieval, and reading stages. Built upon key experimental insights, we proposeseveral memory designs including session decomposition for optimizing valuegranularity, fact-augmented key expansion for enhancing the index structure,and time-aware query expansion for refining the search scope. Experimentresults show that these optimizations greatly improve both memory recall anddownstream question answering on LongMemEval. Overall, our study providesvaluable resources and guidance for advancing the long-term memory capabilitiesof LLM-based chat assistants, paving the way toward more personalized andreliable conversational AI.

Quick Read (beta)

loading the full paper ...