Abstract
Recent advances in depth-recurrent language models show that recurrence candecouple train-time compute and parameter count from test-time compute. In thiswork, we study how to convert existing pretrained non-recurrent language modelsinto depth-recurrent models. We find that using a curriculum of recurrences toincrease the effective depth of the model over the course of training preservesperformance while reducing total computational cost. In our experiments, onmathematics, we observe that converting pretrained models to recurrent onesresults in better performance at a given compute budget than simplypost-training the original non-recurrent language model.
Quick Read (beta)
loading the full paper ...