Abstract
The rapid development of Large Language Models (LLMs) has led to greatstrides in model capabilities like reasoning and long-context understanding.However, as LLMs are able to process longer contexts, it becomes morechallenging to evaluate whether they have acquired certain capabilities, sincethe length of text (e.g., 100K tokens) they can process far exceeds what humanscan reliably assess in a reasonable duration. In this paper, we propose usingcomplex synthetic tasks as a proxy evaluation method, and present S3Eval, aSynthetic, Scalable, Systematic evaluation suite for LLMs evaluation. As asynthetic benchmark, S3Eval enables the creation of any number of evaluationexamples that are theoretically invisible to LLMs, mitigating the test setcontamination issue. The synthetic nature of S3Eval provides users full controlover the dataset, allowing them to systematically probe LLM capabilities byscaling text length and varying task difficulty across diverse scenarios. Thestrong correlation between S3Eval performance and scores of real-worldbenchmarks like Big-Bench Hard (BBH) demonstrates the soundness of using S3Evalfor evaluation of LLMs. The in-depth analysis also uncover additional insights,including performance drop when the answer is sparsely distributed or locatedin the middle context, as well as some counter-intuitive trends of modelperformance.