Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

  • 2025-04-10 18:57:33
  • Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou
  • 0

Abstract

Despite their impressive performance on complex tasks, current languagemodels (LMs) typically operate in a vacuum: Each input query is processedseparately, without retaining insights from previous attempts. Here, we presentDynamic Cheatsheet (DC), a lightweight framework that endows a black-box LMwith a persistent, evolving memory. Rather than repeatedly re-discovering orre-committing the same solutions and mistakes, DC enables models to store andreuse accumulated strategies, code snippets, and general problem-solvinginsights at inference time. This test-time learning enhances performancesubstantially across a range of tasks without needing explicit ground-truthlabels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more thandoubled on AIME math exams once it began retaining algebraic insights acrossquestions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to99% after the model discovered and reused a Python-based solution. In tasksprone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4oand Claude to reach near-perfect accuracy by recalling previously validatedcode, whereas their baselines stagnated around 50%. Beyond arithmeticchallenges, DC yields notable accuracy gains on knowledge-demanding tasks.Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Proproblems. Crucially, DC's memory is self-curated, focusing on concise,transferable snippets rather than entire transcript. Unlike finetuning orstatic retrieval methods, DC adapts LMs' problem-solving skills on the fly,without modifying their underlying parameters. Overall, our findings present DCas a promising approach for augmenting LMs with persistent memory, bridging thedivide between isolated inference events and the cumulative, experience-drivenlearning characteristic of human cognition.

 

Quick Read (beta)

loading the full paper ...