Abstract
We present TextAtari, a benchmark for evaluating language agents on verylong-horizon decision-making tasks spanning up to 100,000 steps. By translatingthe visual state representations of classic Atari games into rich textualdescriptions, TextAtari creates a challenging test bed that bridges sequentialdecision-making with natural language processing. The benchmark includes nearly100 distinct tasks with varying complexity, action spaces, and planninghorizons, all rendered as text through an unsupervised representation learningframework (AtariARI). We evaluate three open-source large language models(Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks(zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess howdifferent forms of prior knowledge affect performance on these long-horizonchallenges. Four scenarios-Basic, Obscured, Manual Augmentation, andReference-based-investigate the impact of semantic understanding, instructioncomprehension, and expert demonstrations on agent decision-making. Our resultsreveal significant performance gaps between language agents and human playersin extensive planning tasks, highlighting challenges in sequential reasoning,state tracking, and strategic planning across tens of thousands of steps.TextAtari provides standardized evaluation protocols, baseline implementations,and a framework for advancing research at the intersection of language modelsand planning. Our code is available athttps://github.com/Lww007/Text-Atari-Agents.