Abstract
With Deep Reinforcement Learning (DRL) being increasingly considered for thecontrol of real-world systems, the lack of transparency of the neural networkat the core of RL becomes a concern. Programmatic Reinforcement Learning (PRL)is able to to create representations of this black-box in the form of sourcecode, not only increasing the explainability of the controller but alsoallowing for user adaptations. However, these methods focus on distilling ablack-box policy into a program and do so after learning using the Mean SquaredError between produced and wanted behaviour, discarding other elements of theRL algorithm. The distilled policy may therefore perform significantly worsethan the black-box learned policy. In this paper, we propose to directly learn a program as the policy of an RLagent. We build on TD3 and use its critics as the basis of the objectivefunction of a genetic algorithm that syntheses the program. Our approach buildsthe program during training, as opposed to after the fact. This steers theprogram to actual high rewards, instead of a simple Mean Squared Error. Also,our approach leverages the TD3 critics to achieve high sample-efficiency, asopposed to pure genetic methods that rely on Monte-Carlo evaluations. Ourexperiments demonstrate the validity, explainability and sample-efficiency ofour approach in a simple gridworld environment.