DNAHLM -- DNA sequence and Human Language mixed large language Model

Abstract

There are already many DNA large language models, but most of them stillfollow traditional uses, such as extracting sequence features forclassification tasks. More innovative applications of large language models,such as prompt engineering, RAG, and zero-shot or few-shot prediction, remainchallenging for DNA-based models. The key issue lies in the fact that DNAmodels and human natural language models are entirely separate; however,techniques like prompt engineering require the use of natural language, therebysignificantly limiting the application of DNA large language models. This paperintroduces a pre-trained model trained on the GPT-2 network, combining DNAsequences and English text, and uses a unified BPE tokenization method. We thenconvert classification and other downstream tasks into Alpaca formatinstruction data, and perform instruction fine-tuning on this pre-trained modelto create a fine-tuned model capable of handling multiple tasks. The model hasdemonstrated its effectiveness in DNA related zero-shot prediction andmultitask application. This research provides a highly promising direction forbuilding a unified DNA sequence task framework.

Quick Read (beta)

loading the full paper ...