Abstract
Uncovering hidden symbolic laws from time series data, as an aspirationdating back to Kepler's discovery of planetary motion, remains a core challengein scientific discovery and artificial intelligence. While Large LanguageModels show promise in structured reasoning tasks, their ability to inferinterpretable, context-aligned symbolic structures from time series data isstill underexplored. To systematically evaluate this capability, we introduceSymbolBench, a comprehensive benchmark designed to assess symbolic reasoningover real-world time series across three tasks: multivariate symbolicregression, Boolean network inference, and causal discovery. Unlike priorefforts limited to simple algebraic equations, SymbolBench spans a diverse setof symbolic forms with varying complexity. We further propose a unifiedframework that integrates LLMs with genetic programming to form a closed-loopsymbolic reasoning system, where LLMs act both as predictors and evaluators.Our empirical results reveal key strengths and limitations of current models,highlighting the importance of combining domain knowledge, context alignment,and reasoning structure to improve LLMs in automated scientific discovery.