Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy

Abstract

A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trainedVision Transformers (ViT) involves freezing the majority of the backboneparameters and solely learning low-rank adaptation weight matrices toaccommodate downstream tasks. These low-rank matrices are commonly derivedthrough the multiplication structure of down-projection and up-projectionmatrices, exemplified by methods such as LoRA and Adapter. In this work, weobserve an approximate orthogonality among any two row or column vectors withinany weight matrix of the backbone parameters; however, this property is absentin the vectors of the down/up-projection matrices. Approximate orthogonalityimplies a reduction in the upper bound of the model's generalization error,signifying that the model possesses enhanced generalization capability. If thefine-tuned down/up-projection matrices were to exhibit this same property asthe pre-trained backbone matrices, could the generalization capability offine-tuned ViTs be further augmented? To address this question, we propose anApproximately Orthogonal Fine-Tuning (AOFT) strategy for representing thelow-rank weight matrices. This strategy employs a single learnable vector togenerate a set of approximately orthogonal vectors, which form thedown/up-projection matrices, thereby aligning the properties of these matriceswith those of the backbone. Extensive experimental results demonstrate that ourmethod achieves competitive performance across a range of downstream imageclassification tasks, confirming the efficacy of the enhanced generalizationcapability embedded in the down/up-projection matrices.

Quick Read (beta)

loading the full paper ...