Abstract
As a data-driven paradigm, offline reinforcement learning (Offline RL) hasbeen formulated as sequence modeling, where the Decision Transformer (DT) hasdemonstrated exceptional capabilities. Unlike previous reinforcement learningmethods that fit value functions or compute policy gradients, DT adjusts theautoregressive model based on the expected returns, past states, and actions,using a causally masked Transformer to output the optimal action. However, dueto the inconsistency between the sampled returns within a single trajectory andthe optimal returns across multiple trajectories, it is challenging to set anexpected return to output the optimal action and stitch together suboptimaltrajectories. Decision ConvFormer (DC) is easier to understand in the contextof modeling RL trajectories within a Markov Decision Process compared to DT. Wepropose the Q-value Regularized Decision ConvFormer (QDC), which combines theunderstanding of RL trajectories by DC and incorporates a term that maximizesaction values using dynamic programming methods during training. This ensuresthat the expected returns of the sampled actions are consistent with theoptimal returns. QDC achieves excellent performance on the D4RL benchmark,outperforming or approaching the optimal level in all tested environments. Itparticularly demonstrates outstanding competitiveness in trajectory stitchingcapability.