Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

Abstract

We present a study on reinforcement learning (RL) from human bandit feedbackfor sequence-to-sequence learning, exemplified by the task of bandit neuralmachine translation (NMT). We investigate the reliability of human banditfeedback, and analyze the influence of reliability on the learnability of areward estimator, and the effect of the quality of reward estimates on theoverall RL task. Our analysis of cardinal (5-point ratings) and ordinal(pairwise preferences) feedback shows that their intra- and inter-annotator$\alpha$-agreement is comparable. Best reliability is obtained for standardizedcardinal feedback, and cardinal feedback is also easiest to learn andgeneralize from. Finally, improvements of over 1 BLEU can be obtained byintegrating a regression-based reward estimator trained on cardinal feedbackfor 800 translations into RL for NMT. This shows that RL is possible even fromsmall amounts of fairly reliable human feedback, pointing to a great potentialfor applications at larger scale.

Quick Read (beta)

loading the full paper ...