Gradient flows and proximal splitting methods: a unified view on accelerated and stochastic optimization

Abstract

Optimization is at the heart of machine learning, statistics, and severalapplied scientific disciplines. Proximal algorithms form a class of methodsthat are broadly applicable and are particularly well-suited to nonsmooth,constrained, large-scale, and distributed optimization problems. There areessentially five proximal algorithms currently known, each proposed in seminalwork: forward-backward splitting, Tseng splitting, Douglas-Rachford,alternating direction method of multipliers, and the more recent Davis-Yin.Such methods sit on a higher level of abstraction compared to gradient-basedmethods, having deep roots in nonlinear functional analysis. In this paper, weshow that all of these algorithms can be derived as different discretizationsof a single differential equation, namely the simple gradient flow which datesback to Cauchy (1847). An important aspect behind many of the success storiesin machine learning relies on "accelerating" the convergence of first ordermethods. However, accelerated methods are notoriously difficult to analyze,counterintuitive, and without an underlying guiding principle. We show that byemploying similar discretization schemes to Newton's classical equation ofmotion with an additional dissipative force, which we refer to as theaccelerated gradient flow, allow us to obtain accelerated variants of all theseproximal algorithms; the majority of which are new although some recover knowncases in the literature. Moreover, we extend these algorithms to stochasticoptimization settings, allowing us to make connections with Langevin andFokker-Planck equations. Similar ideas apply to gradient descent, heavy ball,and Nesterov's method which are simpler. These results thus provide a unifiedframework from which several optimization methods can be derived from basicphysical systems.

Quick Read (beta)

loading the full paper ...