Abstract
Safe Reinforcement Learning (RL) aims to find a policy that achieves highrewards while satisfying cost constraints. When learning from scratch, safe RLagents tend to be overly conservative, which impedes exploration and restrainsthe overall performance. In many realistic tasks, e.g. autonomous driving,large-scale expert demonstration data are available. We argue that extractingexpert policy from offline data to guide online exploration is a promisingsolution to mitigate the conserveness issue. Large-capacity models, e.g.decision transformers (DT), have been proven to be competent in offline policylearning. However, data collected in real-world scenarios rarely containdangerous cases (e.g., collisions), which makes it prohibitive for the policiesto learn safety concepts. Besides, these bulk policy networks cannot meet thecomputation speed requirements at inference time on real-world tasks such asautonomous driving. To this end, we propose Guided Online Distillation (GOLD),an offline-to-online safe RL framework. GOLD distills an offline DT policy intoa lightweight policy network through guided online safe RL training, whichoutperforms both the offline DT policy and online safe RL algorithms.Experiments in both benchmark safe RL tasks and real-world driving tasks basedon the Waymo Open Motion Dataset (WOMD) demonstrate that GOLD can successfullydistill lightweight policies and solve decision-making problems in challengingsafety-critical scenarios.