Abstract
The development of Large Speech-Language Models (LSLMs) has been slowed byfragmented architectures and a lack of transparency, hindering the systematiccomparison and reproducibility of research. Unlike in the vision-languagedomain, the LSLM field suffers from the common practice of releasing modelweights without their corresponding training data and configurations. Toaddress these critical gaps, we introduce LLaSO, the first fully open,end-to-end framework for large-scale speech-language modeling. LLaSO providesthe community with three essential resources: (1) LLaSO-Align, a 12M-instancespeech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-taskinstruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark forstandardized evaluation. To validate our framework, we build and releaseLLaSO-Base, a 3.8B-parameter reference model trained exclusively on our publicdata. It achieves a normalized score of 0.72, establishing a strong,reproducible baseline that surpasses comparable models. Our analysis revealsthat while broader training coverage enhances performance, significantgeneralization gaps persist on unseen tasks, particularly in pure audioscenarios. By releasing the complete stack of data, benchmarks, and models,LLaSO establishes a foundational open standard to unify research efforts andaccelerate community-driven progress in LSLMs. We release the code, dataset,pretrained models, and results in https://github.com/EIT-NLP/LLaSO.