Abstract
The rapid advancement of Large Language Models (LLMs) highlights the urgentneed for evolving evaluation methodologies that keep pace with improvements inlanguage comprehension and information processing. However, traditionalbenchmarks, which are often static, fail to capture the continually changinginformation landscape, leading to a disparity between the perceived and actualeffectiveness of LLMs in ever-changing real-world scenarios. Our study examinestemporal generalization, which includes the ability to understand, predict, andgenerate text relevant to past, present, and future contexts, revealingsignificant temporal biases in LLMs. We propose an evaluation framework, fordynamically generating benchmarks from recent real-world predictions.Experiments demonstrate that LLMs struggle with temporal generalization,showing performance decline over time. These findings highlight the necessityfor improved training and updating processes to enhance adaptability and reducebiases. Our code, dataset and benchmark are available athttps://github.com/FreedomIntelligence/FreshBench.