Safe Deployment of Offline Reinforcement Learning via Input Convex Action Correction

Abstract

Offline reinforcement learning (offline RL) offers a promising framework fordeveloping control strategies in chemical process systems using historicaldata, without the risks or costs of online experimentation. This workinvestigates the application of offline RL to the safe and efficient control ofan exothermic polymerisation continuous stirred-tank reactor. We introduce aGymnasium-compatible simulation environment that captures the reactor'snonlinear dynamics, including reaction kinetics, energy balances, andoperational constraints. The environment supports three industrially relevantscenarios: startup, grade change down, and grade change up. It also includesreproducible offline datasets generated from proportional-integral controllerswith randomised tunings, providing a benchmark for evaluating offline RLalgorithms in realistic process control tasks. We assess behaviour cloning and implicit Q-learning as baseline algorithms,highlighting the challenges offline agents face, including steady-state offsetsand degraded performance near setpoints. To address these issues, we propose anovel deployment-time safety layer that performs gradient-based actioncorrection using input convex neural networks (PICNNs) as learned cost models.The PICNN enables real-time, differentiable correction of policy actions bydescending a convex, state-conditioned cost surface, without requiringretraining or environment interaction. Experimental results show that offline RL, particularly when combined withconvex action correction, can outperform traditional control approaches andmaintain stability across all scenarios. These findings demonstrate thefeasibility of integrating offline RL with interpretable and safety-awarecorrections for high-stakes chemical process control, and lay the groundworkfor more reliable data-driven automation in industrial systems.

Quick Read (beta)

loading the full paper ...