Reducing the variance in online optimization by transporting past gradients

Abstract

Most stochastic optimization methods use gradients once before discardingthem. While variance reduction methods have shown that reusing past gradientscan be beneficial when there is a finite number of datapoints, they do noteasily extend to the online setting. One issue is the staleness due to usingpast gradients. We propose to correct this staleness using the idea of implicitgradient transport (IGT) which transforms gradients computed at previousiterates into gradients evaluated at the current iterate without using theHessian explicitly. In addition to reducing the variance and bias of ourupdates over time, IGT can be used as a drop-in replacement for the gradientestimate in a number of well-understood methods such as heavy ball or Adam. Weshow experimentally that it achieves state-of-the-art results on a wide rangeof architectures and benchmarks. Additionally, the IGT gradient estimatoryields the optimal asymptotic convergence rate for online stochasticoptimization in the restricted setting where the Hessians of all componentfunctions are equal.

Quick Read (beta)

loading the full paper ...