ElChat: Adapting Chat Language Models Using Only Target Unlabeled Language Data

Abstract

Vocabulary expansion (VE) is the de-facto approach to language adaptation oflarge language models (LLMs) by adding new tokens and continuing pre-trainingon target data. While this is effective for base models trained on unlabeleddata, it poses challenges for chat models trained to follow instructionsthrough labeled conversation data. Directly adapting the latter with VE ontarget unlabeled data may result in forgetting chat abilities. While ideal,target chat data is often unavailable or costly to create for low-resourcelanguages, and machine-translated alternatives are not always effective. Toaddress this issue, previous work proposed using a base and chat model from thesame family. This method first adapts the base LLM with VE on target unlabeleddata and then converts it to a chat model by adding a chat vector (CV) derivedfrom the weight difference between the source base and chat models. We proposeElChat, a new language adaptation method for chat LLMs that adapts a chat modeldirectly on target unlabeled data, without a base model. It elicits chatabilities by injecting information from the source chat model. ElChat offersmore robust and competitive target language and safety performance whileachieving superior English, chat, and instruction-following abilities comparedto CV.

Quick Read (beta)

loading the full paper ...