Abstract
Language identification (LID) is a critical step in curating multilingual LLMpretraining corpora from web crawls. While many studies on LID model trainingfocus on collecting diverse training data to improve performance, low-resourcelanguages -- often limited to single-domain data, such as the Bible -- continueto perform poorly. To resolve these class imbalance and bias issues, we proposea novel supervised contrastive learning (SCL) approach to learndomain-invariant representations for low-resource languages. Through anextensive analysis, we show that our approach improves LID performance onout-of-domain data for low-resource languages by 3.2%, demonstrating itseffectiveness in enhancing LID models.
Quick Read (beta)
loading the full paper ...