RbRL2.0: Integrated Reward and Policy Learning for Rating-based Reinforcement Learning

Abstract

Reinforcement learning (RL), a common tool in decision making, learnspolicies from various experiences based on the associated cumulativereturn/rewards without treating them differently. On the contrary, humans oftenlearn to distinguish from different levels of performance and extract theunderlying trends towards improving their decision making for best performance.Motivated by this, this paper proposes a novel RL method that mimics humans'decision making process by differentiating among collected experiences foreffective policy learning. The main idea is to extract important directionalinformation from experiences with different performance levels, named ratings,so that policies can be updated towards desired deviation from theseexperiences with different ratings. Specifically, we propose a new policy lossfunction that penalizes distribution similarities between the current policyand failed experiences with different ratings, and assign different weights tothe penalty terms based on the rating classes. Meanwhile, reward learning fromthese rated samples can be integrated with the new policy loss towards anintegrated reward and policy learning from rated samples. Optimizing theintegrated reward and policy loss function will lead to the discovery ofdirections for policy improvement towards maximizing cumulative rewards andpenalizing most from the lowest performance level while least from the highestperformance level. To evaluate the effectiveness of the proposed method, wepresent results for experiments on a few typical environments that showimproved convergence and overall performance over the existing rating-basedreinforcement learning method with only reward learning.

Quick Read (beta)

loading the full paper ...