LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Abstract

Fine-tuning large pre-trained models on downstream tasks has been adopted ina variety of domains recently. However, it is costly to update the entireparameter set of large pre-trained models. Although recently proposedparameter-efficient transfer learning (PETL) techniques allow updating a smallsubset of parameters (e.g. only using 2% of parameters) inside a pre-trainedbackbone network for a new task, they only reduce the training memoryrequirement by up to 30%. This is because the gradient computation for thetrainable parameters still requires backpropagation through the largepre-trained backbone model. To address this, we propose Ladder Side-Tuning(LST), a new PETL technique that reduces training memory requirements by moresubstantial amounts. Unlike existing parameter-efficient methods that insertadditional parameters inside backbone networks, we train a ladder side network,a small and separate network that takes intermediate activations as input viashortcut connections (ladders) from backbone networks and makes predictions.LST has significantly lower memory requirements than previous methods, becauseit does not require backpropagation through the backbone network, but insteadonly through the side network and ladder connections. We evaluate our methodwith various models (T5, CLIP-T5) on both NLP (GLUE) and vision-language (VQA,GQA, NLVR2, MSCOCO) tasks. LST saves 69% of the memory costs to fine-tune thewhole network, while other methods only save 26% of that in similar parameterusages (hence, 2.7x more memory savings). Moreover, LST achieves higheraccuracy than Adapter and LoRA in a low-memory regime. To further show theadvantage of this better memory efficiency, we also apply LST to larger T5models (T5-large, T5-3B), attaining better GLUE performance than fullfine-tuning and other PETL methods. The exact same trend also holds in ourexperiments on VL tasks.

Quick Read (beta)

loading the full paper ...