Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization

  • 2024-11-27 18:57:03
  • Cheng Tang, Zhishuai Liu, Pan Xu
The Distributionally Robust Markov Decision Process (DRMDP) is a popularframework for addressing dynamics shift in reinforcement learning by learningpolicies robust to the worst-case transition dynamics within a constrained set.However, solving its dual optimization oracle poses significant challenges,limiting theoretical analysis and computational efficiency. The recentlyproposed Robust Regularized Markov Decision Process (RRMDP) replaces theuncertainty set constraint with a regularization term on the value function,offering improved scalability and theoretical insights. Yet, existing RRMDPmethods rely on unstructured regularization, often leading to overlyconservative policies by considering transitions that are unrealistic. Toaddress these issues, we propose a novel framework, the $d$-rectangular linearrobust regularized Markov decision process ($d$-RRMDP), which introduces alinear latent structure into both transition kernels and regularization. Forthe offline RL setting, where an agent learns robust policies from apre-collected dataset in the nominal environment, we develop a family ofalgorithms, Robust Regularized Pessimistic Value Iteration (R2PVI), employinglinear function approximation and $f$-divergence based regularization terms ontransition kernels. We provide instance-dependent upper bounds on thesuboptimality gap of R2PVI policies, showing these bounds depend on how wellthe dataset covers state-action spaces visited by the optimal robust policyunder robustly admissible transitions. This term is further shown to befundamental to $d$-RRMDPs via information-theoretic lower bounds. Finally,numerical experiments validate that R2PVI learns robust policies and iscomputationally more efficient than methods for constrained DRMDPs.


