Abstract
Offline reinforcement learning (RL) enables training from fixed data withoutonline interaction, but policies learned offline often struggle when deployedin dynamic environments due to distributional shift and unreliable valueestimates on unseen state-action pairs. We introduce Behavior-AdaptiveQ-Learning (BAQ), a framework designed to enable a smooth and reliabletransition from offline to online RL. The key idea is to leverage an implicitbehavioral model derived from offline data to provide a behavior-consistencysignal during online fine-tuning. BAQ incorporates a dual-objective loss that(i) aligns the online policy toward the offline behavior when uncertainty ishigh, and (ii) gradually relaxes this constraint as more confident onlineexperience is accumulated. This adaptive mechanism reduces error propagationfrom out-of-distribution estimates, stabilizes early online updates, andaccelerates adaptation to new scenarios. Across standard benchmarks, BAQconsistently outperforms prior offline-to-online RL approaches, achievingfaster recovery, improved robustness, and higher overall performance. Ourresults demonstrate that implicit behavior adaptation is a principled andpractical solution for reliable real-world policy deployment.