RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models models via Romanization

Abstract

This study addresses the challenge of extending Large Language Models (LLMs)to non-English languages using non-Roman scripts. We propose an approach thatutilizes the romanized form of text as an interface for LLMs, hypothesizingthat its frequent informal use and shared tokens with English enhancecross-lingual alignment. Our approach involves the continual pretraining of anEnglish LLM like Llama 2 on romanized text of non-English, non-Roman scriptlanguages, followed by instruction tuning on romanized data. The resultsindicate that romanized text not only reduces token fertility by 2x-4x but alsomatches or outperforms native script representation across various NLU, NLG,and MT tasks. Moreover, the embeddings computed on romanized text exhibitcloser alignment with their English translations than those from the nativescript. Our approach presents a promising direction for leveraging the power ofEnglish LLMs in languages traditionally underrepresented in NLP.

Quick Read (beta)

loading the full paper ...