Abstract
The tension between data privacy and model utility has become the definingbottleneck for the practical deployment of large language models (LLMs) trainedon sensitive corpora including healthcare. Differentially private stochasticgradient descent (DP-SGD) guarantees formal privacy, yet it does so at apronounced cost: gradients are forcibly clipped and perturbed with noise,degrading sample efficiency and final accuracy. Numerous variants have beenproposed to soften this trade-off, but they all share a handicap: their controlknobs are hard-coded, global, and oblivious to the evolving optimizationlandscape. Consequently, practitioners are forced either to over-spend privacybudget in pursuit of utility, or to accept mediocre models in order to staywithin privacy constraints. We present RLDP, the first framework to cast DPoptimization itself as a closed-loop control problem amenable to modern deepreinforcement learning (RL). RLDP continuously senses rich statistics of thelearning dynamics and acts by selecting fine-grained per parametergradient-clipping thresholds as well as the magnitude of injected Gaussiannoise. A soft actor-critic (SAC) hyper-policy is trained online during languagemodel fine-tuning; it learns, from scratch, how to allocate the privacy budgetwhere it matters and when it matters. Across more than 1,600 ablationexperiments on GPT2-small, Llama-1B, Llama-3B, and Mistral-7B, RLDP deliversperplexity reductions of 1.3-30.5% (mean 5.4%) and an average 5.6% downstreamutility gain. RLDP reaches each baseline's final utility after only 13-43% ofthe gradient-update budget (mean speed-up 71%), all while honoring the same($\epsilon$, $\delta$)-DP contract and exhibiting equal or lower susceptibilityto membership-inference and canary-extraction attacks.