Abstract
The performance of a reinforcement learning algorithm can vary drasticallyduring learning because of exploration. Existing algorithms provide littleinformation about their current policy's quality before executing it, and thushave limited use in high-stakes applications like healthcare. In this paper, weaddress such a lack of accountability by proposing that algorithms outputpolicy certificates, which upper bound the suboptimality in the next episode,allowing humans to intervene when the certified quality is not satisfactory. Wefurther present a new learning framework (IPOC) for finite-sample analysis withpolicy certificates, and develop two IPOC algorithms that enjoy guarantees forthe quality of both their policies and certificates.
Quick Read (beta)
loading the full paper ...