Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains

Abstract

Large language models (LLMs) have achieved remarkable success acrosswidespread tasks, yet their application in low-resource domains remains asignificant challenge due to data scarcity and the high risk of overfitting.While in-domain data is limited, there exist vast amounts of similargeneral-domain data, and our initial findings reveal that they couldpotentially serve as auxiliary supervision for domain enhancement. Thisobservation leads us to our central research question: \textbf{\textit{how toeffectively select the most valuable auxiliary data to maximize domain-specificperformance}}, particularly when traditional methods are inapplicable due to alack of large in-domain data pools or validation sets. To address this, wepropose \textbf{NTK-Selector}, a principled and efficient framework forselecting general-domain auxiliary data to enhance domain-specific performancevia neural tangent kernels (NTK). Our method tackles two challenges of directlyapplying NTK to LLMs, theoretical assumptions and prohibitive computationalcost, by empirically demonstrating a stable NTK-like behavior in LLMs duringLoRA fine-tuning and proposing a Jacobian-free approximation method. Extensiveexperiments across four low-resource domains (medical, financial, legal, andpsychological) demonstrate that NTK-Selector consistently improves downstreamperformance. Specifically, fine-tuning on 1,000 in-domain samples alone onlyyielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. Incontrast, enriching with 9,000 auxiliary samples selected by NTK-Selector ledto substantial \textbf{gains of +8.7 and +5.1 points}, which corresponds to a\textbf{10.9x and 5.7x improvement} over the domain-only setting.

Quick Read (beta)

loading the full paper ...