Abstract
Released Large Language Models (LLMs) are often paired with a claimedknowledge cutoff date, or the dates at which training data was gathered. Suchinformation is crucial for applications where the LLM must provide up to dateinformation. However, this statement only scratches the surface: do allresources in the training data share the same knowledge cutoff date? Does themodel's demonstrated knowledge for these subsets closely align to their cutoffdates? In this work, we define the notion of an effective cutoff. This isdistinct from the LLM designer reported cutoff and applies separately tosub-resources and topics. We propose a simple approach to estimate effectivecutoffs on the resource-level temporal alignment of an LLM by probing acrossversions of the data. Using this analysis, we find that effective cutoffs oftendiffer from reported cutoffs. To understand the root cause of this observation,we conduct a direct large-scale analysis on open pre-training datasets. Ouranalysis reveals two reasons for these inconsistencies: (1) temporal biases ofCommonCrawl data due to non-trivial amounts of old data in new dumps and (2)complications in LLM deduplication schemes involving semantic duplicates andlexical near-duplicates. Overall, our results show that knowledge cutoffs arenot as simple as they have seemed and that care must be taken both by LLMdataset curators as well as practitioners who seek to use information fromthese models.