Abstract
Reinforcement learning is essential for neural architecture search andhyperparameter optimization, but the conventional approaches impede widespreaduse due to prohibitive time and computational costs. Inspired by DeepSeek-V3multi-token prediction architecture, we propose Sequential Policy Gradientmodeling (SPG), a novel trajectory generation paradigm for lightweight onlinehyperparameter optimization. In contrast to conventional policy gradientmethods, SPG extends the base model with temporary modules, enabling it togenerate state-action (padded) trajectories in a single forward pass. Ourexperiments demonstrate that models gain performance when retrained with SPG ontheir original datasets and also outperform standard transfer fine-tuning. Weevaluate on five datasets spanning computer vision (ImageNet, COCO), naturallanguage processing (GLUE, SQuAD), and audio (SUPERB) to assess the industrialapplicability of SPG. The proposed method demonstrates consistent improvementsacross widely adopted models, achieving performance gains of $+0.2\sim7\%$,with significantly low computational costs. Fully reproducible code andpre-trained models: https://huggingface.co/UniversalAlgorithmic/SPG.