Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

Abstract

Going beyond stochastic gradient descent (SGD), what new phenomena emerge inwide neural networks trained by adaptive optimizers like Adam? Here we show:The same dichotomy between feature learning and kernel behaviors (as in SGD)holds for general optimizers as well, including Adam -- albeit with a nonlinearnotion of "kernel." We derive the corresponding "neural tangent" and "maximalupdate" limits for any architecture. Two foundational advances underlie theabove results: 1) A new Tensor Program language, NEXORT, that can express howadaptive optimizers process gradients into updates. 2) The introduction ofbra-ket notation to drastically simplify expressions and calculations in TensorPrograms. This work summarizes and generalizes all previous results in theTensor Programs series of papers.

Quick Read (beta)

loading the full paper ...