Abstract
We investigate the discounting mismatch in actor-critic algorithmimplementations from a representation learning perspective. Theoretically,actor-critic algorithms usually have discounting for both actor and critic,i.e., there is a $\gamma^t$ term in the actor update for the transitionobserved at time $t$ in a trajectory and the critic is a discounted valuefunction. Practitioners, however, usually ignore the discounting ($\gamma^t$)for the actor while using a discounted critic. We investigate this mismatch intwo scenarios. In the first scenario, we consider optimizing an undiscountedobjective $(\gamma = 1)$ where $\gamma^t$ disappears naturally $(1^t = 1)$. Wethen propose to interpret the discounting in critic in terms of abias-variance-representation trade-off and provide supporting empiricalresults. In the second scenario, we consider optimizing a discounted objective($\gamma < 1$) and propose to interpret the omission of the discounting in theactor update from an auxiliary task perspective and provide supportingempirical results.