Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

Abstract

Model merging, such as model souping, is the practice of combining differentmodels with the same architecture together without further training. In thiswork, we present a model merging methodology that addresses the difficulty offine-tuning Large Language Models (LLMs) for target tasks in non-Englishlanguages, where task-specific data is often unavailable. We focus onmathematical reasoning and without in-language math data, facilitatecross-lingual transfer by composing language and math capabilities. Startingfrom the same pretrained model, we fine-tune separate "experts" on mathinstruction data in English and on generic instruction data in the targetlanguage. We then replace the top and bottom transformer layers of the mathexpert directly with layers from the language expert, which consequentlyenhances math performance in the target language. The resulting merged modelsoutperform the individual experts and other merging methods on the mathbenchmark, MGSM, by 10% across four major languages where math instruction datais scarce. In addition, this layer swapping is simple, inexpensive, andintuitive, as it is based on an interpretative analysis of the most importantparameter changes during the fine-tuning of each expert. The ability tosuccessfully re-compose LLMs for cross-lingual transfer in this manner opens upfuture possibilities to combine model expertise, create modular solutions, andtransfer reasoning capabilities across languages all post hoc.

Quick Read (beta)

loading the full paper ...