Abstract
Offline reinforcement learning (RL) heavily relies on the coverage ofpre-collected data over the target policy's distribution. Existing studies aimto improve data-policy coverage to mitigate distributional shifts, but overlooksecurity risks from insufficient coverage, and the single-step analysis is notconsistent with the multi-step decision-making nature of offline RL. To addressthis, we introduce the sequence-level concentrability coefficient to quantifycoverage, and reveal its exponential amplification on the upper bound ofestimation errors through theoretical analysis. Building on this, we proposethe Collapsing Sequence-Level Data-Policy Coverage (CSDPC) poisoning attack.Considering the continuous nature of offline RL data, we convert state-actionpairs into decision units, and extract representative decision patterns thatcapture multi-step behavior. We identify rare patterns likely to causeinsufficient coverage, and poison them to reduce coverage and exacerbatedistributional shifts. Experiments show that poisoning just 1% of the datasetcan degrade agent performance by 90%. This finding provides new perspectivesfor analyzing and safeguarding the security of offline RL.