Abstract
Linear RNN architectures, like Mamba, can be competitive with Transformermodels in language modeling while having advantageous deploymentcharacteristics. Given the focus on training large-scale Transformer models, weconsider the challenge of converting these pretrained models for deployment. Wedemonstrate that it is feasible to distill large Transformers into linear RNNsby reusing the linear projection weights from attention layers with academicGPU resources. The resulting hybrid model, which incorporates a quarter of theattention layers, achieves performance comparable to the original Transformerin chat benchmarks and outperforms open-source hybrid Mamba models trained fromscratch with trillions of tokens in both chat benchmarks and generalbenchmarks. Moreover, we introduce a hardware-aware speculative decodingalgorithm that accelerates the inference speed of Mamba and hybrid models.Overall we show how, with limited computation resources, we can remove many ofthe original attention layers and generate from the resulting model moreefficiently. Our top-performing model, distilled from Llama3-8B-Instruct,achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model.