MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Abstract

Recent large language models (LLMs) have demonstrated versatile capabilitiesin long-context scenarios. Although some recent benchmarks have been developedto evaluate the long-context capabilities of LLMs, there is a lack ofbenchmarks evaluating the mathematical reasoning abilities of LLMs over longcontexts, which is crucial for LLMs' application in real-world scenarios. Inthis paper, we introduce MathHay, an automated benchmark designed to assess thelong-context mathematical reasoning capabilities of LLMs. Unlike previousbenchmarks like Needle in a Haystack, which focus primarily on informationretrieval within long texts, MathHay demands models with bothinformation-seeking and complex mathematical reasoning abilities. We conductextensive experiments on MathHay to assess the long-context mathematicalreasoning abilities of eight top-performing LLMs. Even the best-performingmodel, Gemini-1.5-Pro-002, still struggles with mathematical reasoning overlong contexts, achieving only 51.26% accuracy at 128K tokens. This highlightsthe significant room for improvement on the MathHay benchmark.

Quick Read (beta)

loading the full paper ...