Abstract
The driving factors behind the development of large language models (LLMs)with impressive learning capabilities are their colossal model sizes andextensive training datasets. Along with the progress in natural languageprocessing, LLMs have been frequently made accessible to the public to fosterdeeper investigation and applications. However, when it comes to trainingdatasets for these LLMs, especially the recent state-of-the-art models, theyare often not fully disclosed. Creating training data for high-performing LLMsinvolves extensive cleaning and deduplication to ensure the necessary level ofquality. The lack of transparency for training data has thus hampered researchon attributing and addressing hallucination and bias issues in LLMs, hinderingreplication efforts and further advancements in the community. These challengesbecome even more pronounced in multilingual learning scenarios, where theavailable multilingual text datasets are often inadequately collected andcleaned. Consequently, there is a lack of open-source and readily usabledataset to effectively train LLMs in multiple languages. To overcome thisissue, we present CulturaX, a substantial multilingual dataset with 6.3trillion tokens in 167 languages, tailored for LLM development. Our datasetundergoes meticulous cleaning and deduplication through a rigorous pipeline ofmultiple stages to accomplish the best quality for model training, includinglanguage identification, URL-based filtering, metric-based cleaning, documentrefinement, and data deduplication. CulturaX is fully released to the public inHuggingFace to facilitate research and advancements in multilingual LLMs:https://huggingface.co/datasets/uonlp/CulturaX.