Abstract
The effectiveness of Large Language Models (LLMs) in solving tasks vastlydepends on the quality of the instructions, which often require fine-tuningthrough extensive human effort. This highlights the need for automatedinstruction optimization; however, this optimization is particularlychallenging when dealing with black-box LLMs, where model parameters andgradients remain inaccessible. We propose ACING, a task-specific promptoptimization approach framed as a stateless continuous-action ReinforcementLearning (RL) problem, known as the continuum bandit setting. ACING leveragesan actor-critic-based method to optimize prompts, learning fromnon-differentiable reward signals. We validate ACING by optimizing prompts forChatGPT on 30 instruction-based tasks. ACING consistently outperforms baselinemethods, achieving a median score improvement of 10 percentage points.Furthermore, ACING not only recovers but also surpasses human-crafted expertinstructions, achieving up to a 39 percentage point improvement against humanbenchmarks.