LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment

Abstract

While Large Language Models have gained attention, many service developersstill rely on embedding-based models due to practical constraints. In suchcases, the quality of fine-tuning data directly impacts performance, andEnglish datasets are often used as seed data for training non-English models.In this study, we propose LANGALIGN, which enhances target language processingby aligning English embedding vectors with those of the target language at theinterface between the language model and the task header. Experiments onKorean, Japanese, and Chinese demonstrate that LANGALIGN significantly improvesperformance across all three languages. Additionally, we show that LANGALIGNcan be applied in reverse to convert target language data into a format that anEnglish-based model can process.

Quick Read (beta)

loading the full paper ...