Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents

Abstract

Generalizable alignment is a core challenge for deploying Large LanguageModels (LLMs) safely in real-world NLP applications. Current alignment methods,including Reinforcement Learning from Human Feedback (RLHF), often fail toguarantee constraint satisfaction outside their training distribution due totheir reliance on implicit, post-hoc preferences. Inspired by a paradigm shiftto first curate data before tuning, we introduce a new framework for safelanguage alignment that learns natural language constraints from positive andnegative demonstrations as a primary step. From inferring both a task-specificreward function and latent constraint functions, our approach fostersadaptation to novel safety requirements and robust generalization under domainshifts and adversarial inputs. We formalize the framework within a ConstrainedMarkov Decision Process (CMDP) and validate it via a text-based navigationenvironment, demonstrating safe adaptation to changing danger zones. Ourexperiments show fewer violations upon domain shift when following a safenavigation path, and we achieve zero violations by applying learned constraintsto a distilled BERT model as a fine-tuning technique. This work offers apromising path toward building safety-critical and more generalizable LLMs forpractical NLP settings.

Quick Read (beta)

loading the full paper ...