Scaling Efficient LLMs

  • 2025-09-22 13:37:52
  • B. N. Kausik
  • 0

Abstract

Trained LLMs in the transformer architecture are typically sparse in thatmost of the parameters are negligible, raising questions on efficiency.Furthermore, the so called "AI scaling law" for transformers suggests that thenumber of parameters must scale linearly with the size of the data. Inresponse, we inquire into efficient LLMs, i.e. those with the fewest parametersthat achieve the desired accuracy on a training corpus. Specifically, bycomparing theoretical and empirical estimates of the Kullback-Lieblerdivergence, we derive a natural AI scaling law that the number of parameters inan efficient LLM scales as $D^{\gamma}$ where $D$ is the size of the trainingdata and $ \gamma \in [0.44, 0.72]$, suggesting the existence of more efficientarchitectures. Against this backdrop, we propose recurrent transformers,combining the efficacy of transformers with the efficiency of recurrentnetworks, progressively applying a single transformer layer to a fixed-widthsliding window across the input sequence. Recurrent transformers (a) run inlinear time in the sequence length, (b) are memory-efficient and amenable toparallel processing in large batches, (c) learn to forget history for languagetasks, or accumulate history for long range tasks like copy and selective copy,and (d) are amenable to curriculum training to overcome vanishing gradients. Inour experiments, we find that recurrent transformers perform favorably onbenchmark tests.

 

Quick Read (beta)

loading the full paper ...