Abstract
Time roots in applying language models for biomedical applications: modelsare trained on historical data and will be deployed for new or future data,which may vary from training data. While increasing biomedical tasks haveemployed state-of-the-art language models, there are very few studies haveexamined temporal effects on biomedical models when data usually shifts acrossdevelopment and deployment. This study fills the gap by statistically probingrelations between language model performance and data shifts across threebiomedical tasks. We deploy diverse metrics to evaluate model performance,distance methods to measure data drifts, and statistical methods to quantifytemporal effects on biomedical language models. Our study shows that timematters for deploying biomedical language models, while the degree ofperformance degradation varies by biomedical tasks and statisticalquantification approaches. We believe this study can establish a solidbenchmark to evaluate and assess temporal effects on deploying biomedicallanguage models.