Abstract
We present LongLoRA, an efficient fine-tuning approach that extends thecontext sizes of pre-trained large language models (LLMs), with limitedcomputation cost. Typically, training LLMs with long context sizes iscomputationally expensive, requiring extensive training hours and GPUresources. For example, training on the context length of 8192 needs 16xcomputational costs in self-attention layers as that of 2048. In this paper, wespeed up the context extension of LLMs in two aspects. On the one hand,although dense global attention is needed during inference, fine-tuning themodel can be effectively and efficiently done by sparse local attention. Theproposed shift short attention effectively enables context extension, leadingto non-trivial computation saving with similar performance to fine-tuning withvanilla attention. Particularly, it can be implemented with only two lines ofcode in training, while being optional in inference. On the other hand, werevisit the parameter-efficient fine-tuning regime for context expansion.Notably, we find that LoRA for context extension works well under the premiseof trainable embedding and normalization. LongLoRA demonstrates strongempirical results on various tasks on LLaMA2 models from 7B/13B to 70B.LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on asingle 8x A100 machine. LongLoRA extends models' context while retaining theiroriginal architectures, and is compatible with most existing techniques, likeFlashAttention-2. In addition, to make LongLoRA practical, we collect adataset, LongQA, for supervised fine-tuning. It contains more than 3k longcontext question-answer pairs.