GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks

Abstract

This paper introduces GraphOmni, a comprehensive benchmark designed toevaluate the reasoning capabilities of LLMs on graph-theoretic tasksarticulated in natural language. GraphOmni encompasses diverse graph types,serialization formats, and prompting schemes, significantly exceeding priorefforts in both scope and depth. Through extensive systematic evaluation, weidentify critical interactions among these dimensions, demonstrating theirsubstantial impact on model performance. Our experiments reveal thatstate-of-the-art models like Claude-3.5 and o4-mini consistently outperformother models, yet even these leading models exhibit substantial room forimprovement. Performance variability is evident depending on the specificcombinations of factors we considered, underscoring the necessity ofcomprehensive evaluations across these interconnected dimensions. Additionally,we observe distinct impacts of serialization and prompting strategies betweenopen-source and closed-source models, encouraging the development of tailoredapproaches. Motivated by the findings, we also propose a reinforcementlearning-inspired framework that adaptively selects the optimal factorsinfluencing LLM reasoning capabilities. This flexible and extendable benchmarknot only deepens our understanding of LLM performance on structured tasks butalso provides a robust foundation for advancing research in LLM-based graphreasoning. The code and datasets are available athttps://github.com/GAI-Community/GraphOmni.

Quick Read (beta)

loading the full paper ...