Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models

Abstract

Fine-tuning Large Language Models (LLMs) with first-order methods likeback-propagation is computationally intensive. Zeroth-Order (ZO) optimisationuses function evaluations instead of gradients, reducing memory usage, butsuffers from slow convergence in high-dimensional models. As a result, ZOresearch in LLMs has mostly focused on classification, overlooking more complexgenerative tasks. In this paper, we introduce ZOPrO, a novel ZO algorithmdesigned for Preference Optimisation in LLMs. We begin by analysing theinterplay between policy and reward models during traditional (first-order)Preference Optimisation, uncovering patterns in their relative updates. Guidedby these insights, we adapt Simultaneous Perturbation Stochastic Approximation(SPSA) with a targeted sampling strategy to accelerate convergence. Throughexperiments on summarisation, machine translation, and conversationalassistants, we demonstrate that our method consistently enhances reward signalswhile achieving convergence times comparable to first-order methods. While itfalls short of some state-of-the-art methods, our work is the first to applyZeroth-Order methods to Preference Optimisation in LLMs, going beyondclassification tasks and paving the way for a largely unexplored researchdirection. Code and visualisations are available athttps://github.com/alessioGalatolo/VisZOPrO

Quick Read (beta)

loading the full paper ...