UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following

  • 2025-09-29 17:53:09
  • FaQiang Qian, WeiKun Zhang, Ziliang Wang, Kang An, Xuhui Zheng, Liangjian Wen, Mengya Gao, Yong Dai, Yichao Wu
  • 0

Abstract

Shaping powerful LLMs to be beneficial and safe is central to AI alignment.We argue that post-training alignment is fundamentally a unified PreferenceLearning problem, involving two modalities: demonstrated preferences (e.g.,Supervised Fine-Tuning, SFT) and comparative preferences (e.g., ReinforcementLearning, RL).The standard sequential pipeline-SFT followed by RL-is flawed dueto a critical distributional mismatch: SFT uses static expert data, but as thepolicy evolves, its generation distribution drifts, making SFT knowledgebrittle. Subsequent RL then explores without direct access to the rich,ground-truth knowledge in expert demonstrations, leading to inefficient,ungrounded updates. This separation prevents mutual regularization between datasources. To address this, we reframe alignment as a constrained optimizationproblem and propose Unified Adversarial Preference Learning (UniAPL),a novelframework that dynamically aligns the policy's distribution with the expert's.UniAPL implements a single-stage unified training objective, jointly learningfrom mixed batches of SFT and preference data. In every gradient step, denseexpert demonstrations directly ground and regularize online exploration,inherently resolving distributional mismatch and maximizing data synergy.Weevaluate UniAPL on instruction-following tasks using Qwen3-235B-Instruct-2507as the teacher. Our models match or exceed strong GRPO baselines: +5.77% onQwen3-0.6B (matching a 32B model) and +3.75% on Qwen3-4B,even outperforming theteacher. Analyses of response length and log-probability distributions confirmthat UniAPL outputs closely mimic expert demonstrations, achieving bothstronger performance and better behavioral alignment.

 

Quick Read (beta)

loading the full paper ...