Abstract
The performance of a reinforcement learning algorithm can vary drasticallyduring learning because of exploration. Existing algorithms provide littleinformation about the quality of their current policy before executing it, andthus have limited use in high-stakes applications, such as healthcare. Weaddress this lack of accountability by proposing that algorithms output policycertificates. These certificates bound the sub-optimality and return of thepolicy in the next episode, allowing humans to intervene when the certifiedquality is not satisfactory. We further introduce two new algorithms withcertificates and present a new framework for theoretical analysis thatguarantees the quality of their policies and certificates. For tabular MDPs, weshow that computing certificates can even improve the sample-efficiency ofoptimism-based exploration. As a result, one of our algorithms achieves regretand PAC bounds that are minimax up to lower-order terms.