Abstract
We know from reinforcement learning theory that temporal difference learningcan fail in certain cases. Sutton and Barto (2018) identify a deadly triad offunction approximation, bootstrapping, and off-policy learning. When thesethree properties are combined, learning can diverge with the value estimatesbecoming unbounded. However, several algorithms successfully combine thesethree properties, which indicates that there is at least a partial gap in ourunderstanding. In this work, we investigate the impact of the deadly triad inpractice, in the context of a family of popular deep reinforcement learningmodels - deep Q-networks trained with experience replay - analysing how thecomponents of this system play a role in the emergence of the deadly triad, andin the agent's performance