DNAHLM -- DNA sequence and Human Language mixed large language Model

Abstract

There are already many DNA large language models, but most of them stillfollow traditional uses, such as extracting sequence features forclassification tasks. More innovative applications of large language models,such as prompt engineering, RAG, and zero-shot or few-shot prediction, remainchallenging for DNA-based models. The key issue lies in the fact that DNAmodels and human natural language models are entirely separate; however,techniques like prompt engineering require the use of natural language, therebysignificantly limiting the application of DNA large language models. This paperintroduces a hybrid model trained on the GPT-2 network, combining DNA sequencesand English text to explore the potential of using prompts and fine-tuning inDNA models. The model has demonstrated its effectiveness in DNA relatedzero-shot prediction and multitask application.

Quick Read (beta)

loading the full paper ...