Unplug and Play Language Models: Decomposing Experts in Language Models at Inference Time

  • 2025-08-21 14:04:54
  • Nakyeong Yang, Jiwon Moon, Junseok Kim, Yunah Jang, Kyomin Jung
  • 0

Abstract

Enabled by large-scale text corpora with huge parameters, pre-trainedlanguage models operate as multi-task experts using a single modelarchitecture. However, recent studies have revealed that certain neurons playdisproportionately important roles in solving specific tasks, suggesting thattask-relevant substructures can be isolated and selectively activated for eachtask. Therefore, we introduce Decomposition of Experts (DoE), a novel frameworkthat dynamically identifies and activates task-specific experts within alanguage model to reduce inference cost without sacrificing accuracy. We firstdefine a task expert as a set of parameters that significantly influence theperformance of a specific task and propose a four-step unplug-and-play process:(1) receiving a user request, (2) identifying the corresponding task expert,(3) performing inference using the expert-localized model, and (4) restoringthe original model and waiting for the next task. Using attribution methods andprompt tuning, DoE isolates task-relevant neurons, minimizing computationaloverhead while maintaining task performance. We assume a setting where alanguage model receives user requests from five widely used natural languageunderstanding benchmarks, processing one task at a time. In this setup, wedemonstrate that DoE achieves up to a x1.73 inference speed-up with a 65%pruning rate, without compromising accuracy. Comparisons with various taskexpert localization methods reveal that DoE effectively identifies taskexperts, while ablation studies validate the importance of its components.Additionally, we analyze the effects of batch size, token count, and layertypes on inference speed-up, providing practical insights for adopting DoE. Theproposed framework is both practical and scalable, applicable to anytransformer-based architecture, offering a robust solution for efficienttask-specific inference.

 

Quick Read (beta)

loading the full paper ...