Can linguists better understand DNA?

Abstract

Multilingual transfer ability, which reflects how well models fine-tuned onone source language can be applied to other languages, has been well studied inmultilingual pre-trained models. However, the existence of such capabilitytransfer between natural language and gene sequences/languages remains underexplored.This study addresses this gap by drawing inspiration from thesentence-pair classification task used for evaluating sentence similarity innatural language. We constructed two analogous tasks: DNA-pairclassification(DNA sequence similarity) and DNA-protein-pairclassification(gene coding determination). These tasks were designed tovalidate the transferability of capabilities from natural language to genesequences. Even a small-scale pre-trained model like GPT-2-small, which waspre-trained on English, achieved an accuracy of 78% on the DNA-pairclassification task after being fine-tuned on English sentence-pairclassification data(XTREME PAWS-X). While training a BERT model on multilingualtext, the precision reached 89%. On the more complex DNA-protein-pairclassification task, however, the model's output was barely distinguishablefrom random output.Experimental validation has confirmed that the transfer ofcapabilities from natural language to biological language is unequivocallypresent. Building on this foundation, we have also investigated the impact ofmodel parameter scale and pre-training on this capability transfer. We providerecommendations for facilitating the transfer of capabilities from naturallanguage to genetic language,as well as new approaches for conductingbiological research based on this capability.This study offers an intriguingnew perspective on exploring the relationship between natural language andgenetic language.

Quick Read (beta)

loading the full paper ...