Dynamic Policy Fusion for User Alignment Without Re-Interaction

Abstract

Deep reinforcement learning (RL) policies, although optimal in terms of taskrewards, may not align with the personal preferences of human users. To ensurethis alignment, a naive solution would be to retrain the agent using a rewardfunction that encodes the user's specific preferences. However, such a rewardfunction is typically not readily available, and as such, retraining the agentfrom scratch can be prohibitively expensive. We propose a more practicalapproach - to adapt the already trained policy to user-specific needs with thehelp of human feedback. To this end, we infer the user's intent throughtrajectory-level feedback and combine it with the trained task policy via atheoretically grounded dynamic policy fusion approach. As our approach collectshuman feedback on the very same trajectories used to learn the task policy, itdoes not require any additional interactions with the environment, making it azero-shot approach. We empirically demonstrate in a number of environments thatour proposed dynamic policy fusion approach consistently achieves the intendedtask while simultaneously adhering to user-specific needs.

Quick Read (beta)

loading the full paper ...