Federated Instruction Tuning of LLMs with Domain Coverage Augmentation

Abstract

Federated Domain-specific Instruction Tuning (FedDIT) utilizes limitedcross-client private data alongside server-side public data for instructionaugmentation, ultimately enhancing model performance within specific domains.While the factors affecting FedDIT remain unclear and existing instructionaugmentation methods mainly focus on the centralized setting withoutconsidering the distributed environment. Our experiments reveal that thecross-client domain coverage, rather than data heterogeneity, drives modelperformance in FedDIT. In response, we propose FedDCA, which optimizes domaincoverage through greedy client center selection and retrieval-basedaugmentation. To alleviate client-side computational burdens, FedDCA$^*$ usesheterogeneous encoders with server-side feature alignment. Extensiveexperiments across four distinct domains (code, medical, financial, andmathematical) substantiate the effectiveness of both methods. Additionally, weinvestigate privacy preservation against memory extraction attacks utilizingvarying amounts of public data. Results show no significant correlation betweenthe volume of public data and the privacy-preserving capability. However, asthe fine-tuning round increases, the risk of privacy leakage reduces orconverges.

Quick Read (beta)

loading the full paper ...