Abstract
Deep reinforcement learning (DRL) has proven extremely useful in a largevariety of application domains. However, even successful DRL-based software canexhibit highly undesirable behavior. This is due to DRL training being based onmaximizing a reward function, which typically captures general trends butcannot precisely capture, or rule out, certain behaviors of the system. In thispaper, we propose a novel framework aimed at drastically reducing theundesirable behavior of DRL-based software, while maintaining its excellentperformance. In addition, our framework can assist in providing engineers witha comprehensible characterization of such undesirable behavior. Under the hood,our approach is based on extracting decision tree classifiers from erroneousstate-action pairs, and then integrating these trees into the DRL trainingloop, penalizing the system whenever it performs an error. We provide aproof-of-concept implementation of our approach, and use it to evaluate thetechnique on three significant case studies. We find that our approach canextend existing frameworks in a straightforward manner, and incurs only aslight overhead in training time. Further, it incurs only a very slight hit toperformance, or even in some cases - improves it, while significantly reducingthe frequency of undesirable behavior.