Gaperon: A Peppered English-French Generative Language Model Suite

Abstract

We release Gaperon, a fully open suite of French-English-coding languagemodels designed to advance transparency and reproducibility in large-scalemodel training. The Gaperon family includes 1.5B, 8B, and 24B parameter modelstrained on 2-4 trillion tokens, released with all elements of the trainingpipeline: French and English datasets filtered with a neural qualityclassifier, an efficient data curation and training framework, and hundreds ofintermediate checkpoints. Through this work, we study how data filtering andcontamination interact to shape both benchmark and generative performance. Wefind that filtering for linguistic quality enhances text fluency and coherencebut yields subpar benchmark results, and that late deliberate contamination --continuing training on data mixes that include test sets -- recoverscompetitive scores while only reasonably harming generation quality. We discusshow usual neural filtering can unintentionally amplify benchmark leakage. Tosupport further research, we also introduce harmless data poisoning duringpretraining, providing a realistic testbed for safety studies. By openlyreleasing all models, datasets, code, and checkpoints, Gaperon establishes areproducible foundation for exploring the trade-offs between data curation,evaluation, safety, and openness in multilingual language model development.

Quick Read (beta)

loading the full paper ...