FOSP: Fine-tuning Offline Safe Policy through World Models

Abstract

Offline Safe Reinforcement Learning (RL) seeks to address safety constraintsby learning from static datasets and restricting exploration. However, theseapproaches heavily rely on the dataset and struggle to generalize to unseenscenarios safely. In this paper, we aim to improve safety during the deploymentof vision-based robotic tasks through online fine-tuning an offline pretrainedpolicy. To facilitate effective fine-tuning, we introduce model-based RL, whichis known for its data efficiency. Specifically, our method employs in-sampleoptimization to improve offline training efficiency while incorporatingreachability guidance to ensure safety. After obtaining an offline safe policy,a safe policy expansion approach is leveraged for online fine-tuning. Theperformance of our method is validated on simulation benchmarks with fivevision-only tasks and through real-world robot deployment using limited data.It demonstrates that our approach significantly improves the generalization ofoffline policies to unseen safety-constrained scenarios. To the best of ourknowledge, this is the first work to explore offline-to-online RL for safegeneralization tasks.

Quick Read (beta)

loading the full paper ...