UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

Abstract

Large language models (LLMs) under-perform on low-resource languages due tolimited training data. We present a method to efficiently collect text data forlow-resource languages from the entire Common Crawl corpus. Our approach,UnifiedCrawl, filters and extracts common crawl using minimal computeresources, yielding mono-lingual datasets much larger than previously availablesources. We demonstrate that leveraging this data to fine-tuning multilingualLLMs via efficient adapter methods (QLoRA) significantly boosts performance onthe low-resource language, while minimizing VRAM usage. Our experiments showlarge improvements in language modeling perplexity and an increase in few-shotprompting scores. Our work and released source code provide an affordableapproach to improve LLMs for low-resource languages using consumer hardware.Our source code is available here athttps://github.com/bethelmelesse/unifiedcrawl.

Quick Read (beta)

loading the full paper ...