Abstract
Keeping large foundation models up to date on latest data is inherentlyexpensive. To avoid the prohibitive costs of constantly retraining, it isimperative to continually train these models. This problem is exacerbated bythe lack of any large scale continual learning benchmarks or baselines. Weintroduce the first set of web-scale Time-Continual (TiC) benchmarks fortraining vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps.TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-textpairs spanning 9 years (2014-2022). We first use our benchmarks to curatevarious dynamic evaluations to measure temporal robustness of existing models.We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$zero-shot accuracy on our curated retrieval task from 2021-2022 compared withmore recently trained models in OpenCLIP repository. We then study how toefficiently train models on time-continuous data. We demonstrate that a simplerehearsal-based approach that continues training from the last checkpoint andreplays old data reduces compute by $2.5\times$ when compared to the standardpractice of retraining from scratch. Code is available athttps://github.com/apple/ml-tic-clip.