Abstract
Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks:the backward pass of first-order optimizers like Adam increases memory usage tomore than 10 times the inference level (e.g., 633 GB for OPT-30B). Zeroth-order(ZO) optimizers avoid this cost by estimating gradients only from forwardpasses, yet existing methods like MeZO usually require many more steps toconverge. Can this trade-off between speed and memory in ZO be fundamentallyimproved? Normalized-SGD demonstrates strong empirical performance with greatermemory efficiency than Adam. In light of this, we introduce FZOO, a FastZeroth-Order Optimizer toward Adam-Scale Speed. FZOO reduces the total forwardpasses needed for convergence by employing batched one-sided estimates thatadapt step sizes based on the standard deviation of batch losses. It alsoaccelerates per-batch computation through the use of Rademacher random vectorperturbations coupled with CUDA's parallel processing. Extensive experiments ondiverse models, including RoBERTa-large, OPT (350M-66B), Phi-2, and Llama3,across 11 tasks validate FZOO's effectiveness. On average, FZOO outperformsMeZO by 3 percent in accuracy while requiring 3 times fewer forward passes. ForRoBERTa-large, FZOO achieves average improvements of 5.6 percent in accuracyand an 18 times reduction in forward passes compared to MeZO, achievingconvergence speeds comparable to Adam. We also provide theoretical analysisproving FZOO's formal equivalence to a normalized-SGD update rule and itsconvergence guarantees. FZOO integrates smoothly into PEFT techniques, enablingeven larger memory savings. Overall, our results make single-GPU, high-speed,full-parameter fine-tuning practical and point toward future work onmemory-efficient pre-training.