Optimizing Millions of Hyperparameters by Implicit Differentiation

Abstract

We propose an algorithm for inexpensive gradient-based hyperparameteroptimization that combines the implicit function theorem (IFT) with efficientinverse Hessian approximations. We present results about the relationshipbetween the IFT and differentiating through optimization, motivating ouralgorithm. We use the proposed approach to train modern network architectureswith millions of weights and millions of hyper-parameters. For example, welearn a data-augmentation network - where every weight is a hyperparametertuned for validation performance - outputting augmented training examples.Jointly tuning weights and hyperparameters with our approach is only a fewtimes more costly in memory and compute than standard training.

Quick Read (beta)

loading the full paper ...