DPO: A Differential and Pointwise Control Approach to Reinforcement Learning

Abstract

Reinforcement learning (RL) in continuous state-action spaces remainschallenging in scientific computing due to poor sample efficiency and lack ofpathwise physical consistency. We introduce Differential Reinforcement Learning(Differential RL), a novel framework that reformulates RL from acontinuous-time control perspective via a differential dual formulation. Thisinduces a Hamiltonian structure that embeds physics priors and ensuresconsistent trajectories without requiring explicit constraints. To implementDifferential RL, we develop Differential Policy Optimization (DPO), apointwise, stage-wise algorithm that refines local movement operators along thetrajectory for improved sample efficiency and dynamic alignment. We establishpointwise convergence guarantees, a property not available in standard RL, andderive a competitive theoretical regret bound of $O(K^{5/6})$. Empirically, DPOoutperforms standard RL baselines on representative scientific computing tasks,including surface modeling, grid control, and molecular dynamics, underlow-data and physics-constrained conditions.

Quick Read (beta)

loading the full paper ...