Abstract
While large language models (LLMs) have achieved remarkable performanceacross a wide range of tasks, their massive scale incurs prohibitivecomputational and memory costs for pre-training from scratch. Recent studieshave investigated the use of low-rank parameterization as a means of reducingmodel size and training cost. In this context, sparsity is often employed as acomplementary technique to recover important information lost in low-rankcompression by capturing salient features in the residual space. However,existing approaches typically combine low-rank and sparse components in asimplistic or ad hoc manner, often resulting in undesirable performancedegradation compared to full-rank training. In this paper, we propose\textbf{LO}w-rank and \textbf{S}parse pre-\textbf{T}raining (\textbf{LOST}) forLLMs, a novel method that ingeniously integrates low-rank and sparse structuresto enable effective training of LLMs from scratch under strict efficiencyconstraints. LOST applies singular value decomposition to weight matrices,preserving the dominant low-rank components, while allocating the remainingsingular values to construct channel-wise sparse components to complement theexpressiveness of low-rank training. We evaluate LOST on LLM pretrainingranging from 60M to 7B parameters. Our experiments show that LOST achievescompetitive or superior performance compared to full-rank models, whilesignificantly reducing both memory and compute overhead. Moreover, Code isavailable at\href{https://github.com/JiaxiLi1/LOST-Low-rank-and-Sparse-Training-for-Large-Language-Models}{LOSTRepo}