User-Interactive Offline Reinforcement Learning

Abstract

Offline reinforcement learning algorithms still lack trust in practice due tothe risk that the learned policy performs worse than the original policy thatgenerated the dataset or behaves in an unexpected way that is unfamiliar to theuser. At the same time, offline RL algorithms are not able to tune their mostimportant hyperparameter - the proximity of the learned policy to the originalpolicy. We propose an algorithm that allows the user to tune thishyperparameter at runtime, thereby addressing both of the above mentionedissues simultaneously. This allows users to start with the original behaviorand grant successively greater deviation, as well as stopping at any time whenthe policy deteriorates or the behavior is too far from the familiar one.

Quick Read (beta)

loading the full paper ...