Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

  • 2025-09-17 17:59:21
  • Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabolčec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, Ines Altemir Marinas, Mohammad Hossein Amani, Matin Ansaripour, Ilia Badanin, Harold Benoit, Emanuela Boros, Nicholas Browning, Fabian Bösch, Maximilian Böther, Niklas Canova, Camille Challier, Clement Charmillot, Jonathan Coles, Jan Deriu, Arnout Devos, Lukas Drescher, Daniil Dzenhaliou, Maud Ehrmann, Dongyang Fan, Simin Fan, Silin Gao, Miguel Gila, María Grandury, Diba Hashemi, Alexander Hoyle, Jiaming Jiang, Mark Klein, Andrei Kucharavy, Anastasiia Kucherenko, Fre
  • 0

Abstract

We present Apertus, a fully open suite of large language models (LLMs)designed to address two systemic shortcomings in today's open model ecosystem:data compliance and multilingual representation. Unlike many prior models thatrelease weights without reproducible data pipelines or regard for content-ownerrights, Apertus models are pretrained exclusively on openly available data,retroactively respecting robots.txt exclusions and filtering fornon-permissive, toxic, and personally identifiable content. To mitigate risksof memorization, we adopt the Goldfish objective during pretraining, stronglysuppressing verbatim recall of data while retaining downstream taskperformance. The Apertus models also expand multilingual coverage, training on15T tokens from over 1800 languages, with ~40% of pretraining data allocated tonon-English content. Released at 8B and 70B scales, Apertus approachesstate-of-the-art results among fully open models on multilingual benchmarks,rivalling or surpassing open-weight counterparts. Beyond model weights, werelease all scientific artifacts from our development cycle with a permissivelicense, including data preparation scripts, checkpoints, evaluation suites,and training code, enabling transparent audit and extension.

 

Quick Read (beta)

loading the full paper ...