Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models

Abstract

Aligning Large Language Models (LLMs) traditionally relies on costly trainingand human preference annotations. Self-alignment seeks to reduce these expensesby enabling models to align themselves. To further lower costs and achievealignment without any expensive tuning or annotations, we introduce a newtuning-free approach for self-alignment, Dynamic Rewarding with PromptOptimization (DRPO). Our approach leverages a search-based optimizationframework that allows LLMs to iteratively self-improve and craft the optimalalignment instructions, all without additional training or human intervention.The core of DRPO is a dynamic rewarding mechanism, which identifies andrectifies model-specific alignment weaknesses, allowing LLMs to adaptefficiently to diverse alignment challenges. Empirical evaluations on eightrecent LLMs, both open- and closed-sourced, demonstrate that DRPO significantlyenhances alignment performance, with base models outperforming theirSFT/RLHF-tuned counterparts. Moreover, the prompts automatically optimized byDRPO surpass those curated by human experts, further validating theeffectiveness of our approach. Our findings highlight the great potential ofcurrent LLMs to achieve adaptive self-alignment through inference-timeoptimization, complementing tuning-based alignment methods.

Quick Read (beta)

loading the full paper ...