Abstract
The machine learning community has witnessed impressive advancements sincethe first appearance of large language models (LLMs), yet their huge memoryconsumption has become a major roadblock to large-scale training. ParameterEfficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have beenproposed to alleviate this problem, but their performance still fails to matchfull parameter training in most large-scale fine-tuning settings. Attempting tocomplement this deficiency, we investigate layerwise properties of LoRA onfine-tuning tasks and observe an uncommon skewness of weight norms acrossdifferent layers. Utilizing this key observation, a surprisingly simpletraining strategy is discovered, which outperforms both LoRA and full parametertraining in a wide range of settings with memory costs as low as LoRA. We nameit Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA,which applies the idea of importance sampling to different layers in LLMs andrandomly freeze most middle layers during optimization. Experimental resultsshow that with similar or less GPU memory consumption, LISA surpasses LoRA oreven full parameter tuning in downstream fine-tuning tasks, where LISAconsistently outperforms LoRA by over $11\%$-$37\%$ in terms of MT-Benchscores. On large models, specifically LLaMA-2-70B, LISA achieves on-par orbetter performance than LoRA on MT-Bench, GSM8K, and PubMedQA, demonstratingits effectiveness across different domains.