Abstract
Reinforcement learning (RL) algorithms typically deal with maximizing theexpected cumulative return (discounted or undiscounted, finite or infinitehorizon). However, several crucial applications in the real world, such as drugdiscovery, do not fit within this framework because an RL agent only needs toidentify states (molecules) that achieve the highest reward within a trajectoryand does not need to optimize for the expected cumulative return. In this work,we formulate an objective function to maximize the expected maximum rewardalong a trajectory, derive a novel functional form of the Bellman equation,introduce the corresponding Bellman operators, and provide a proof ofconvergence. Using this formulation, we achieve state-of-the-art results on thetask of molecule generation that mimics a real-world drug discovery pipeline.