Abstract
Large Language Models (LLMs) are increasingly used as chatbots, yet theirability to personalize responses to user preferences remains limited. Weintroduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorizeand adhere to user preferences in a long-context conversational setting.PrefEval comprises 3,000 manually curated user preference and query pairsspanning 20 topics. PrefEval contains user personalization or preferenceinformation in both explicit and implicit forms, and evaluates LLM performanceusing a generation and a classification task. With PrefEval, we evaluated theaforementioned preference following capabilities of 10 open-source andproprietary LLMs in multi-session conversations with varying context lengths upto 100k tokens. We benchmark with various prompting, iterative feedback, andretrieval-augmented generation methods. Our benchmarking effort reveals thatstate-of-the-art LLMs face significant challenges in proactively followingusers' preferences during conversations. In particular, in zero-shot settings,preference following accuracy falls below 10% at merely 10 turns (~3k tokens)across most evaluated models. Even with advanced prompting and retrievalmethods, preference following still deteriorates in long-context conversations.Furthermore, we show that fine-tuning on PrefEval significantly improvesperformance. We believe PrefEval serves as a valuable resource for measuring,understanding, and enhancing LLMs' preference following abilities, paving theway for personalized conversational agents. Our code and dataset are availableat https://prefeval.github.io/.