ATM: Improving Model Merging by Alternating Tuning and Merging

Abstract

Model merging has recently emerged as a cost-efficient paradigm formulti-task learning. Among current approaches, task arithmetic stands out forits simplicity and effectiveness. In this paper, we motivate the effectivenessof task vectors by linking them to multi-task gradients. We show that in asingle-epoch scenario, task vectors are mathematically equivalent to thegradients obtained via gradient descent in a multi-task setting, and stillapproximate these gradients in subsequent epochs. Furthermore, we show thattask vectors perform optimally when equality is maintained, and theireffectiveness is largely driven by the first epoch's gradient. Building on thisinsight, we propose viewing model merging as a single step in an iterativeprocess that Alternates between Tuning and Merging (ATM). This method acts as abridge between model merging and multi-task gradient descent, achievingstate-of-the-art results with the same data and computational requirements. Weextensively evaluate ATM across diverse settings, achieving up to 20% higheraccuracy in computer vision and NLP tasks, compared to the best baselines.Finally, we provide both empirical and theoretical support for itseffectiveness, demonstrating increased orthogonality between task vectors andproving that ATM minimizes an upper bound on the loss obtained by jointlyfinetuning all tasks.

Quick Read (beta)

loading the full paper ...