Learning with Opponent-Learning Awareness

Abstract

Multi-agent settings are quickly gathering importance in machine learning.Beyond a plethora of recent work on deep multi-agent reinforcement learning,hierarchical reinforcement learning, generative adversarial networks anddecentralized optimization can all be seen as instances of this setting.However, the presence of multiple learning agents in these settings renders thetraining problem non-stationary and often leads to unstable training orundesired final results. We present Learning with Opponent-Learning Awareness(LOLA), a method that reasons about the anticipated learning of the otheragents. The LOLA learning rule includes an additional term that accounts forthe impact of the agent's policy on the anticipated parameter update of theother agents. We show that the LOLA update rule can be efficiently calculatedusing an extension of the likelihood ratio policy gradient update, making themethod suitable for model-free RL. This method thus scales to large parameterand input spaces and nonlinear function approximators. Preliminary results showthat the encounter of two LOLA agents leads to the emergence of tit-for-tat andtherefore cooperation in the iterated prisoners' dilemma (IPD), whileindependent learning does not. In this domain, LOLA also receives higherpayouts compared to a naive learner, and is robust against exploitation byhigher order gradient-based methods. Applied to infinitely repeated matchingpennies, LOLA agents converge to the Nash equilibrium. In a round robintournament we show that LOLA agents can successfully shape the learning of arange of multi-agent learning algorithms from literature, resulting in thehighest average returns on the IPD. We also apply LOLA to a grid world taskwith an embedded social dilemma using deep recurrent policies. Again, byconsidering the learning of the other agent, LOLA agents learn to cooperate outof selfish interests.

Quick Read (beta)

loading the full paper ...