Abstract
Going beyond stochastic gradient descent (SGD), what new phenomena emerge inwide neural networks trained by adaptive optimizers like Adam? Here we show:The same dichotomy between feature learning and kernel behaviors (as in SGD)holds for general optimizers as well, including Adam -- albeit with a nonlinearnotion of "kernel." We derive the corresponding "neural tangent" and "maximalupdate" limits for any architecture. Two foundational advances underlie theabove results: 1) A new Tensor Program language, NEXORT, that can express howadaptive optimizers process gradients into updates. 2) The introduction ofbra-ket notation to drastically simplify expressions and calculations in TensorPrograms. This work summarizes and generalizes all previous results in theTensor Programs series of papers.