Abstract
Recent successes of value-based multi-agent deep reinforcement learningemploy optimism by limiting underestimation updates of value functionestimator, through carefully controlled learning rate (Omidshafiei et al.,2017) or reduced update probability (Palmer et al., 2018). To achieve fullcooperation when learning independently, an agent must estimate the statevalues contingent on having optimal teammates; therefore, value overestimationis frequency injected to counteract negative effects caused by unobservableteammate sub-optimal policies and explorations. Aiming to solve this issuethrough automatic scheduling, this paper introduces a decentralized quantileestimator, which we found empirically to be more stable, sample efficient andmore likely to converge to the joint optimal policy.