Abstract
Recent advances in self-supervised models for natural language, vision, andprotein sequences have inspired the development of large genomic DNA languagemodels (DNALMs). These models aim to learn generalizable representations ofdiverse DNA elements, potentially enabling various genomic prediction,interpretation and design tasks. Despite their potential, existing benchmarksdo not adequately assess the capabilities of DNALMs on key downstreamapplications involving an important class of non-coding DNA elements criticalfor regulating gene activity. In this study, we introduce DART-Eval, a suite ofrepresentative benchmarks specifically focused on regulatory DNA to evaluatemodel performance across zero-shot, probed, and fine-tuned scenarios againstcontemporary ab initio models as baselines. Our benchmarks target biologicallymeaningful downstream tasks such as functional sequence feature discovery,predicting cell-type specific regulatory activity, and counterfactualprediction of the impacts of genetic variants. We find that current DNALMsexhibit inconsistent performance and do not offer compelling gains overalternative baseline models for most tasks, while requiring significantly morecomputational resources. We discuss potentially promising modeling, datacuration, and evaluation strategies for the next generation of DNALMs. Our codeis available at https://github.com/kundajelab/DART-Eval.