Gradient Descent: The Ultimate Optimizer

  • 2019-09-29 21:41:49
  • Kartik Chandra, Erik Meijer, Samantha Andow, Emilio Arroyo-Fang, Irene Dea, Johann George, Melissa Grueter, Basil Hosmer, Steffi Stumpos, Alanna Tempest, Shannon Yang
  • 60

Abstract

Working with any gradient-based machine learning algorithm involves thetedious task of tuning the optimizer's hyperparameters, such as the learningrate. There exist many techniques for automated hyperparameter optimization,but they typically introduce even more hyperparameters to control thehyperparameter optimization process. We propose to instead learn thehyperparameters themselves by gradient descent, and furthermore to learn thehyper-hyperparameters by gradient descent as well, and so on ad infinitum. Asthese towers of gradient-based optimizers grow, they become significantly lesssensitive to the choice of top-level hyperparameters, hence decreasing theburden on the user to search for optimal values.

 

Quick Read (beta)

Abstract

Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer’s hyperparameters, such as the learning rate. There exist many techniques for automated hyperparameter optimization, but they typically introduce even more hyperparameters to control the hyperparameter optimization process. We propose to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn the hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As these towers of gradient-based optimizers grow, they become significantly less sensitive to the choice of top-level hyperparameters, hence decreasing the burden on the user to search for optimal values.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

 

Gradient Descent: The Ultimate Optimizer

 

Kartik Chandra0  Erik Meijer0 

Samantha Andow0  Emilio Arroyo-Fang0  Irene Dea0  Johann George0  Melissa Grueter0  Basil Hosmer0  Steffi Stumpos0  Alanna Tempest0  Shannon Yang0 


footnotetext: \forloop@affilnum1¡ 0Stanford University, Palo Alto, California, USA. Correspondence to: Kartik Chandra <[email protected]>, Erik Meijer <[email protected]>.  
Preprint under review.
Copyright 2019 by the author(s).
\@xsect

Usually we think of using gradient descent to optimize weights and other parameters of neural networks. Differentiable programming languages promise to make arbitrary code differentiable, allowing us to use gradient descent to optimize any program parameter that would otherwise be hard-coded by a human. Hence, there is no reason we should not be able to use gradient descent to optimize quantities other than the weights of a neural network, for instance hyperparameters like the gradient descent step size. But we don’t need to stop there, and we can just as well learn the hyper-hyperparameters used to optimize those hyperparameters, along with other constants occurring in gradient descent optimizers (DBLP:journals/corr/Ruder16).

Figure 1: The “hyperoptimization surface” described in Section id1. The thin solid traces are of vanilla SGD optimizers with a variety of choices for the hyperparameter α. The thick orange trace is our desired behavior, where the “hyperoptimizer” learns an optimal α over the course of the training, and thus outperforms the vanilla optimizer that begins at the same α.

In this paper we show that differentiable programming makes it practical to tune arbitrarily tall recursive towers of optimizers, where each optimizer adjusts the hyperparameters of its descendant:

  • Like DBLP:journals/corr/BaydinCRSW17, we independently rediscovered the idea of Almeida1999ParameterAI to implement efficient on-line hyperparameter optimizers by gradient descent. However, we generalize the approach of DBLP:journals/corr/BaydinCRSW17 in several dimensions.

  • In Section id1 we show how to craft the automatic differentiation (AD) computation graph such that the calculations to derive the hyperparameter update formula performed manually by DBLP:journals/corr/BaydinCRSW17 come “for free” as a result of reverse-mode automatic differentiation, just like the update rule for the weights does. This eliminates the need for certain tedious manual computations.

  • In Section id1 we utilize this newfound power to differentiate with respect to hyperparameters beyond just the learning rate, such as Adam’s β1,β2, and ϵ, and in Section id1 show empirically that learning these extra hyperparameters improves results.

  • Furthermore, in Section id1, we realize the vision of recursively stacking multiple levels of hyperparameter optimizers that was only hypothesized by DBLP:journals/corr/BaydinCRSW17 Hyperparameter optimizers can themselves be optimized, as can their optimizers, and so on ad infinitum. We demonstrate empirically in Section id1 that such towers of optimizers are scalable to many recursive levels.

  • Section id1 shows that taller stacks of hyperoptimizers are indeed significantly less sensitive to the choice of top-level hyperparameters. This reduces the burden on humans responsible for tuning the hyperparameters — rather than “seeking needles in a haystack,” we can instead simply “shoot fish in a barrel.”

\@xsect

What does it mean to optimize an optimizer? Consider Figure 1, which depicts a “hyperoptimization surface” for using stochastic gradient descent (SGD) to optimize some loss function f. Each thin trace is a loss curve of SGD with the given step size hyperparameter α. These loss curves form cross-sections of the surface along the α axis, parallel to the batch-number/loss plane.

Notice that with α<10-1 SGD performs poorly because the parameters update too slowly to make meaningful progress by the end of the training period. Similarly, for α>101 SGD also performs poorly because the parameter updates are too noisy to converge to the optimum. The optimal hyperparameter is therefore somewhere between 10-1 and 101. If we were training this model, we would have to manually discover this range by experimentation.

Instead, imagine if we could utilize a variant of SGD to climb down this surface no matter where we started, as demonstrated by the thick orange trace. Unlike the thin traces of vanilla SGD, the thick orange trace is not confined to a single plane — though it begins at the highly sub-optimal α=10-4, it gradually “learns” to increase α, and attains a final loss function on par with the optimal hyperparameter for vanilla SGD. In Section id1 we will describe how to achieve this by adjusting α at each step of SGD.

Note that our approach is not limited to tuning just step sizes. The Adam optimizer, for example, already intelligently adjusts the step size for each parameter based on past progress (Adam). However, Adam still has its own fixed hyperparameters: the learning rate α, the two moment coefficients β1,β2, and the factor ϵ used to avoid division by zero. For instance, the recommended default for ϵ is often quoted as 10-8, but as the TensorFlow documentation remarks, sometimes it is better to use 1.0 or 0.1 instead. As we show in Section id1, we can tune these additional hyperparameters automatically.

Existing work (Maclaurin:2015:GHO:3045118.3045343; Pedregosa:2016:HOA:3045390.3045469; pmlr-v70-franceschi17a), attempts to learn a single optimal hyperparameter for the entire training history — by gradient descent on the dashed black “U” in Figure 1. This is inefficient because it requires memory to store the entire unrolled run. Our work, along with DBLP:journals/corr/BaydinCRSW17 and Almeida1999ParameterAI, uses a stochastic variant of the above instead: perform incremental updates to the hyperparameter in parallel with the learning. Since each incremental update only depends on its immediate history, we can “forget” all but a constant amount of information of the unrolled run that non-stochastic approaches have to “remember” and fully differentiate through.

\@xsect

Consider some stochastic loss function f that we want to minimize using gradient descent, and let wi be the weights at the beginning of step i. Let us first recall the standard weight update rule at step i for SGD, using some (fixed) step size α:

wi+1=wi-αf(wi)wi

We would like to update α as well at each step, so we will index it with the step number also: let αi be the step size at the beginning of step i. At each step, we will first update the step size to αi+1 using some update rule yet to be derived, and then use the updated step size αi+1 to update the weights from wi to wi+1.

αi+1 =αi-adjustment for αi
wi+1 =wi-αi+1f(wi)wi

What should the adjustment for αi be? By analogy to w, we want to adjust αi in the direction of the gradient of the loss function with respect to αi, scaled by some hyper-step size κ. In other words, the adjustment should be κ(f(wi)/αi). Section id1 addresses the practical matter of the selection of this hyper-hyperparameter — for now, we will take κ as a given fixed constant. Our modified update rule is therefore:

αi+1 =αi-κf(wi)αi (1)
wi+1 =wi-αi+1f(wi)wi (2)

All that remains is to compute f(wi)/αi in equation (1).

In the next section, we review how DBLP:journals/corr/BaydinCRSW17 compute this derivative by hand, obtaining an elegant and efficiently-computable expression. In the section that follows, we show how we can compute the partial derivative for the step size update completely automatically, exactly like the partial derivative for the weights. This makes it possible to generalize our approach in many different ways.

\@xsect

One option to compute f(wi)/αi, explored by DBLP:journals/corr/BaydinCRSW17, is to proceed by direct manual computation of the partial derivative. Applying the chain rule to the derivative in question, we can compute

f(wi)αi =f(wi)wiwiαi (3)
=f(wi)wi(wi-1-αif(wi-1)wi-1)αi (4)
=f(wi)wi(-f(wi-1)wi-1) (5)

where (4) is obtained by substituting the update rule in (2) for wi and (5) is obtained by observing that wi-1 does not depend on αi, and can therefore be treated as a constant. In this particular case, the resulting expression is simple and elegant: the dot product of the preceding two gradients with respect to the weights, which as we see from equation  (2) would have already been computed in order to update the weights themselves. We only need to remember the previous derivate, so this is very memory efficient and time efficient.

The direct-computation strategy works well when the update rule is easy to differentiate by hand with respect to the hyperparameter — in SGD as above, it is simply a multiplication by a constant, whose derivative is trivial. However, this is not always the case. Consider, for example, the update rule for the Adam optimizer, as in Algorithm 1 of Adam, which has a much more complicated dependence on the hyperparameters β1 and β2. Differentiating the update rule by hand, we obtain the following results, caveat emptor:

wtαt =-m^t(ϵt+v^t)
wtβ1t =-αt(-f(wt-1)wt-1+mt-1+tβ1t(t-1)m^t)(1-β1tt)(ϵt+v^t)
wtβ2t =αtm^tv^t(-(f(wt-1)wt-1)2+vt-1+tβ2t(t-1)v^t)2vt(ϵt+v^t)2
wtϵt =αtm^t(ϵt+v^t)2

We see how again the derivatives of the loss function with respect to the hyperparameters wt/β1t and wt/β2t are defined in terms of the previous value of the the derivative for the actual parameters f(wt-1)/wt-1, but embedded within a much more complex expression than before. Clearly, this manual approach to compute the hyperparameter update rules does not scale. However, with a little bit of care we can actually compute these derivatives automatically by backwards AD, just like we do for regular parameters.

\@xsect

In order to compute f(wi)/αi automatically, let us first briefly review the operational mechanics of reverse-mode automatic differentiation. Frameworks that provide reverse-mode AD (Griewank) to compute f(wi)/αi do so by building up a backwards computation graph as the function is computed forwardly. For example, when a user computes the loss function f(wi), the framework internally stores a DAG whose leaves are the weights wi, whose internal nodes are intermediate computations (for a DNN the outputs of each successive layer), and whose root is the concrete loss function such as LogSoftmax. The framework can then backpropagate from the backwards computation graph created for this root node, depositing gradients in each node as it descends, until the weights wi at the leaf nodes have accumulated the gradient f(wi)/wi.

Once the gradient f(wi)/wi is computed by the backwards pass, we can then continue to update the weights wi+1=wi-α/f(wi)wi as shown above, and repeat the cycle for the next training batch. However, an important consideration is for the weights to be “detached” from the computation graph before each iteration of this algorithm — that is, for the weights to be forcibly converted to leaves of the graph by removing any inbound edges. The effect of the “detach” operation is depicted in Figure 2. If this step is skipped, the next backpropagation iteration will continue beyond the current weights into the past. This is problematic in a couple of ways, depending on how the weight update is implemented. If the weight update is implemented as an in-place operation, then this will yield incorrect results as more and more gradients get accumulated onto the same node of the computation graph. If the weight update is implemented by creating a fresh node, then over time the computation graph will grow taller linearly in the number of steps taken; because backpropagation is linear in the size of the graph, the overall training would become quadratic-time and intractable.

(a) Computation graph of SGD with a fixed hyperparameter α.
(b) Computation graph of SGD with a continuously-updated hyperparameter αi.
Figure 2: Comparing the computation graphs of vanilla SGD and HyperSGD.

Let’s peek inside the implementation of SGD in PyTorch (PyTorch) as of commit bb41e6 to see how this cutting-off is implemented in the actual source code:

  # line 91 of sgd.py
  d_p = p.grad.data
  #  momentum calculations omitted
  # line 106
  p.data.add_(-group[’lr’], d_p)

Here, p represents the parameter being optimized (i.e. wi) and lr is the learning rate (i.e. α). A few more PyTorch-specific clarifications: p.grad retrieves the gradient of the loss function with respect to p. The call to .add_ updates p in-place with the product of the arguments, that is, with -αf(w)/w. Most importantly, the highlighted calls to .data implement the “detachment” by referring to the datastore of that variable directly, ignoring the associated computation graph information. Note that in vanilla PyTorch the step size is not learned, so no call to .data is required for the learning rate because it is internally stored as a raw Python float rather than a differentiable variable tracked by PyTorch.

For the sake of consistency let us rewrite this function, renaming variables to match the above discussion and promoting alpha to a differentiable variable in the form of a rank-0 tensor. In order to keep the computation graph clear, we will also update weights by creating fresh nodes rather than to change them in-place.

def SGD.__init__(self, alpha):
  self.alpha = tensor(alpha)
def SGD.adjust(w):
  d_w = w.grad.detach()
  w = w.detach() - self.alpha.detach() * d_w

The highlighted calls to .detach() correspond to detaching the weights and their gradients. Now, in order to have backpropagation deposit the gradient with respect to αi as well as wi, we can simply refrain from detaching αi from the graph, detaching instead its parents. This is depicted in Figure 2.

Notice in particular that because we want to compute f(wi)/αi the edge from αi to wi needs to remain intact. To implement this, instead of calling .detach() on alpha directly, we instead call .detach() on its parents when adjusting it using equation (1). This change yields the following fully-automated hyperoptimization algorithm11 1 The example code in this section elides some small PyTorch-related details. Appendix id1 contains the full PyTorch source code for all examples and experiments in this paper.: def HyperSGD.adjust(w):   # update alpha using Equation (1)   d_alpha = self.alpha.grad.detach()   self.alpha = self.alpha.detach() -           kappa.detach() * d_alpha   # update w using Equation (2)   d_w = w.grad.detach()   w = w.detach() - self.alpha.detach() * d_w Notice that because we are only extending the computation graph by a little extra amount (corresponding to evaluating the optimizer), the backwards AD pass should not be significantly more computationally expensive. Section id1 presents an empirical evaluation of the computational cost to hyperoptimization.

\@xsect

As suggested in previous work (Maclaurin:2015:GHO:3045118.3045343), it should be possible to apply gradient-based methods for tuning hyperparameters of common variations on SGD such as AdaGrad (AdaGrad), AdaDelta (AdaDelta), or Adam (Adam). The above implementation of HyperSGD generalizes quite easily to these optimizers. In this section we demonstrate the HyperAdam optimizer, which mostly follows by analogy to HyperSGD. Unlike previous work, which could only optimize Adam’s learning rate (DBLP:journals/corr/BaydinCRSW17), we are able to optimize all four hyperparameters of Adam automatically. Our evaluation in Section id1 demonstrates that this indeed useful to do.

There are, however, two important subtleties to be aware of.

First, because the hyperparameters β1 and β2 must be strictly in the domain (0,1), we clamp the “raw” values to this domain using a scaled sigmoid. Without this step, we might accidentally adjust these values outside their domains, which ultimately leads to arithmetic exceptions.

Second, the Adam optimizer involves the term v^t, which is continuous but not differentiable at v^t=0. Because Adam normally initializes v0=0, backpropagation would fail on the very first step due to a division by zero error. We fix this problem by initializing v0 to ϵ rather than 0.

These two subtleties reveal a limitation of our automatic approach to hyperparameter optimization: while the domain restrictions are evident in the explicit formulas presented above (notice, for example, the vt in the denominator of the expression for wt/β2t derived above, which would immediately to signal a user the potential for division-by-zero), they are more difficult to predict and debug if the derivatives are taken automatically. The hyperparameter update rule does not “know” the domains of the hyperparameters, and so it might step too far and lead to a mysterious crash or nan issue. In practice however, this has not been a showstopper.

Implementing these fixes, and remembering to .detach() the Adam intermediate state (i.e. mt-1 and vt-1) in the right place to prevent “leaks” in the backwards AD, we obtain the following implementation:

def HyperAdam.adjust(w):
  # update Adam hyperparameters by SGD
  d_alpha = self.alpha.grad.detach()
  self.alpha = self.alpha.detach() -
          kappa.detach() * d_alpha
  d_beta1 = self.beta1.grad.detach()
  self.beta1 = self.beta1.detach() -
          kappa.detach() * d_beta1
  d_beta2 = self.beta2.grad.detach()
  self.beta2 = self.beta1.detach() -
          kappa.detach() * d_beta2
  d_eps   = self.eps.grad.detach()
  self.eps = self.eps.detach() -
          kappa.detach() * d_eps
  # clamp coefficients to domain (0, 1)
  beta1_clamp = (tanh(self.beta1) + 1)/2
  beta2_clamp = (tanh(self.beta2) + 1)/2
  # update w using Adam update rule
  self.t += 1
  g = w.grad.detach()
  self.m =
    beta1_clamp * self.m.detach() +
      (1 - self.beta1) * g
  self.v =
    beta2_clamp * self.v.detach() +
      (1 - self.beta2) * g * g
  mhat = self.m / (1 - beta1_clamp**t)
  vhat = self.v / (1 - beta2_clamp**t)
  w = w.detach() -
    self.alpha * mhat /
        (vhat ** 0.5 + self.eps)
\@xsect

At this point it is natural to ask whether the hyperoptimizer can itself be optimized; that is, whether the human-selected hyper-hyperparameter κ to update the hyperparameters (e.g. α) can be adjusted by a hyper-hyperoptimizer. The possibility of doing so recursively ad infinitum to obtain an optimization algorithm that is highly robust to the top-level human-chosen hypernparameter was hypothesized in Section 5.2 of DBLP:journals/corr/BaydinCRSW17. Computing the gradients of these higher-order hyperparameters by hand is impossible without knowing the exact sequence of stacked optimizers ahead of time, and as we have shown above, will be extremely tedious and error prone.

However, the ability to compute these gradients automatically by backwards AD makes it possible to realize this vision. To do so, let us revisit our previous implementation of HyperSGD. Notice that there is an opportunity for recursion lurking here: the adjustment to alpha can be factored out with a call to SGD.adjust, where SGD’s hyperparameter is kappa.

def HyperSGD.adjust(w):
  SGD(kappa).adjust(self.alpha)
  d_w = w.grad.detach()
  w = w.detach() - self.alpha * d_w

Because SGD is already careful to properly detach its parameter (typically w, but in this case α), this implementation is functionally identical to the one above. Indeed, any optimizer that observes this protocol would suffice, so let us abstract out the optimizer as a parameter to HyperSGD:

def HyperSGD.__init__(self, alpha, opt):
  self.alpha = tensor(alpha)
  self.optimizer = opt
def HyperSGD.adjust(w):
  self.optimizer.adjust(self.alpha)
  d_w = w.grad.detach()
  w = w.detach() - self.alpha * d_w
opt = HyperSGD(0.01, opt=SGD(kappa))

After this refactoring, finally, we can recursively feed HyperSGD itself as the optimizer, obtaining a level-2 hyperoptimizer HyperSGD(0.01, HyperSGD(0.01, SGD(0.01))). Similarly, we can imagine taller towers, or towers that mix and match multiple different kinds of optimizers, such as Adam-optimized-by-SGD-optimized-by-Adam.

A natural application of this idea is to automatically learn hyperparameters on a per-parameter basis. For example, when hyperoptimizing Adam with SGD as in Section id1, it is extremely beneficial to maintain a separate hyper-step size (i.e. a separate κ) for each of the four hyperparameters, since they typically span many orders of magnitude. Instead of specifying each κ as a separate top-level hyperparameter, however, we can instead apply a second level of SGD that lets the system automatically learn optimal hyper-step sizes for each Adam hyperparameter separately.

A logical concern is whether this process actually exacerbates the hyperparameter optimization problem by introducing even more hyperparameters. DBLP:journals/corr/BaydinCRSW17 predicted that as the towers of hyperoptimizers grow taller, the resulting algorithms would be less sensitive to the human-chosen hyperparameters, and therefore the overall burden on the user will be reduced. This indeed seems to be the case; Section id1 presents an empirical evaluation of this hypothesis.

\@xsect
Optimizer Test acc Time
SGD(0.01) 77.48% 16ms
SGD(0.01) / SGD(0.01) 88.35% 16ms
SGD(0.145) 88.81% 16ms
SGD(0.01) / Adam(…) 86.80% 23ms
SGD(0.096) 87.71% 16ms
Table 1: Hyperoptimizing SGD. The symbol Adam(…) refers Adam with the standard hyperparameters. Each hyperoptimizer experiment is repeated using the final hyperparameters learned by the algorithm.
Optimizer Test acc Time
Adam(…) — baseline 91.09% 38ms
Adam(…) / SGD(10-3) / SGD(10-4) 92.74% 43ms
Adam( 0.0291, 0.8995, 0.999, -8) 93.74% 40ms
Adamα(…) / SGD(10-3) / SGD(10-4) 92.64% 37ms
Adamα( 0.0284, *) 93.42% 41ms
Adam(…) / Adam(…) 94.35% 41ms
Adam( 0.013, 0.892, 0.998, -8) 94.75% 36ms
Adamα(…) / Adam(…) 94.01% 31ms
Adamα( 0.013, …) 94.39% 32ms
Table 2: Hyperoptimizing Adam. The symbol Adam(…) retains its meaning from Table 1.

In this section we evaluate the hyperoptimizers made possible by our system, exploring in particular the benefits of being able to optimize hyperparameters beyond just step size and of higher-order optimization, as well as whether or not there is a significant computational cost to automatically computing derivatives with respect to hyperparameters.

\@xsect

Like authors of previous work Maclaurin:2015:GHO:3045118.3045343; DBLP:journals/corr/BaydinCRSW17, we conducted all of our experiments were conducted on the MNIST dataset (MNIST) using a neural network with one fully-connected hidden layer of size 128, 𝑡𝑎𝑛ℎ activations, and a batch size of 300 run for a single epoch. We implemented the system in PyTorch and ran experiments on a 2.4 GHz Intel CPU with 32GB of memory. The full source code for each of these experiments is presented in Appendix id1.

\@xsect

We denote species of hyperoptimizers by their sequence of constituent optimizers with their initial hyperparameters. The leftmost item adjusts the parameters of the model whereas the rightmost item has fixed hyperparameters. For example, the term “SGD(0) / Adam(0.001, 0.9, 0.999, -8)” indicates that the weights of the neural network were adjusted by stochastic gradient descent with a step size that, while initially 0, was adjusted by a regular Adam optimizer with hyperparameters α=0.001,β1=0.9,β2=0.999,ϵ=10-8. Adamα denotes an Adam optimizer where only α is optimized as in DBLP:journals/corr/BaydinCRSW17, and the abbreviations Adam(…) and Adamα(…) denote the respective optimizers with the standard hyperparameters (α=0.001,β1=0.9,β2=0.999,ϵ=10-8), which are recommended by Adam and are used by default almost universally across software packages.

\@xsect

Here we seek to answer two questions: (1) whether an SGD hyperoptimizer performs better than an elementary SGD optimizer22 2 Following previous work (Maclaurin:2015:GHO:3045118.3045343; DBLP:journals/corr/BaydinCRSW17), we refer to standard, vanilla “non-hyperoptimized” optimizers as “elementary optimizers.”, and (2) whether or not the learned step size outperforms the initial step size. We test the latter property by running a fresh elementary SGD optimizer with the final learned step size of the hyperoptimizer.

Table 1 summarizes the results of our experiments, run with an initial step size of 0.01 (see Section id1 for a discussion of the sensitivity of these results to this initial step size). We find that hyperoptimized SGD outperforms the baseline by a significant margin (nearly 10%). This holds even if we use an Adam optimizer to adjust the step size of the SGD optimizer.

Furthermore, when we re-ran the elementary optimizers with the new learned hyperparameters, we found that they typically performed incrementally better than the hyperparameter itself. This is what Luketina:2016:SGT:3045390.3045701 refer to as the “hysteresis” effect of hyperparameter optimization: we cannot reap the benefits of the optimized hyperparameters while they are themselves still in the early stages of being optimized — thus, hyperoptimizers should “lag” slightly behind elementary optimizers that start off with the final optimized hyperparameters.

\@xsect

In Section id1, we described how to apply our system to optimizing the Adam optimizer, which maintains first- and second-order momentum information for each parameter it optimizes. Altogether, there are four hyperparameters: a learning rate (α), coefficients for the first- and second-order momenta (β1,β2), and an epsilon value (ϵ). We tune all four simultaneously, first using SGD and then by using Adam itself on the top-level. Our Adam / SGD experiments utilize the higher-order design proposed at the end of Section id1 to learn a separate hyper-step size for each hyperparameter of Adam; this is not needed for Adam / Adam because Adam by design already maintains separate information for each parameter it optimizes.

We seek to answer three questions: (1) whether hyperoptimized Adam optimizers perform better than elementary Adam optimizers, (2) whether the learned hyperparameters outperform the baseline, and (3) whether there is a benefit to optimizing all four hyperparameters, as opposed to only optimizing the learning rate as DBLP:journals/corr/BaydinCRSW17 do.

Table 2 summarizes the results of our experiments. We find that indeed the hyperoptimized Adam optimizer outperforms the elementary Adam optimizer on its “default” settings. As with SGD in Section id1, the learned hyperparameters perform incrementally better than the hyperoptimizer due to the hysteresis effect.

Inspecting the learned hyperparameters, we find that the algorithm significantly raises the learning rate α and slightly lowers β1, but does not significantly affect either β2 or ϵ. Nevertheless, learning β1 does provide a noticeable benefit: our hyperoptimized Adam outperforms hyperoptimized Adamα, which can only learn α. Both hyperoptimizers learn similar optimized values for α, but Adamα cannot also adapt β1, and therefore does not perform as well.

\@xsect
Figure 3: As we stack more and more layers of SGD, the resulting hyperoptimizer is less sensitive to the initial choice of hyperparameters.

In Section id1 we developed an interface for building arbitrarily tall towers of optimizers. Recall that DBLP:journals/corr/BaydinCRSW17 hypothesized that taller towers would yield hyperoptimizers that were more robust to the top-level human-chosen hyperparameters than elementary optimizers are.

To validate this behavior of higher-order hyperoptimizers, we ran the above benchmark with towers of SGD-based hyperoptimizers of increasing heights, where each layer of SGD started with the same initial step size α0. Figure 3 shows the results of this experiment. It is indeed the case that the taller the hyperoptimizer stack, the less sensitive the results become for the top-level hypern-parameters; roughly one order of magnitude per step until after 5-6 levels the graphs converge.

Notice also that the sensitivity only decreases for smaller initial step sizes; all hyperoptimizers performed poorly beyond α0>102. We hypothesize that it is difficult to recover from a too-high initial step size because dramatic changes in parameters at each step make the stochastic loss function too noisy. In comparison, if the hyperoptimizer’s initial step size is too low, then the weights do not change very dramatically at each step, and as a result a “signal” to increase the step size can be extracted from a series of stochastic gradient descent steps.

\@xsect
(a) Higher-order hyperoptimization performance with SGD.
(b) Higher-order hyperoptimization performance with Adam.
Figure 4: As the stacks of hyperoptimizers grow taller, each step of SGD takes longer by a small constant factor, corresponding to the extra step of stepping one node further in the backwards AD computation graph.

When we stack a new hyperparameter optimizer, we are effectively adding a new layer to the computation graph. This corresponds to extending each step of Figure 2 further to the left by yet another node. Thus, with a hyperoptimizer stack of height n we would obtain a computation graph of size O(n) at each step, which means backpropagation takes time O(n). We should therefore expect training with a stack of n hyperoptimizers to take time O(n).

To test this hypothesis, we extended the benchmark from Section id1 to much taller stacks, up to height 50. Note that these experiments are meant to stress-test the system with significantly taller stacks than would typically be necessary (recall from Section id1 that the stacks of hyperoptimizers need not be taller than 3-4 to be highly effective).

As shown in Figure 4, higher-order hyperoptimization is indeed asymptotically linear-time in the height of the optimizer stack. Note how the slope of this linear relationship is quite small compared to the fixed computational cost of backpropagating through the loss function. This makes sense: the additional work at each level is only the computational cost of backpropagating through the new top-level optimizer, which is typically much simpler than the machine learning model itself. Indeed, the difference in slopes between higher-order SGD in Figure 4 and higher-order Adam in Figure 4 is simply because Adam is a more complex optimizer, requiring more computation to differentiate through.

In summary, we find that in practice higher-order hyperoptimization is an extremely lightweight addition to any machine learning model with great benefits.

\@xsect

Hyperparameter optimization has a long history, and we refer readers interested in the full story to a recent survey by Feurer2019.

Most existing work on gradient-based hyperparameter optimization (BengioHyperopt; domke; Maclaurin:2015:GHO:3045118.3045343; Pedregosa:2016:HOA:3045390.3045469; pmlr-v70-franceschi17a) has focused on computing hyperparameter gradients after several iterations of training, which is computationally expensive because of the need to backpropagate through much more computation. DBLP:journals/corr/BaydinCRSW17, building on a technique first published by Almeida1999ParameterAI, propose instead updating hyperparameters at each step. Luketina:2016:SGT:3045390.3045701 apply a similar technique to regularization hyperparameters, though they explicitly note that their proposed method could work in principle for any continuous hyperparameter.

As discussed above, we expand upon this latter line of work in three directions: (1) by optimizing hyperparameters beyond just the learning rate; (2) by fully automating this process, rather than requiring manual derivative computations; and (3) by realizing the vision of recursively constructing higher-order hyperoptimizers and evaluating the resulting algorithms.

\@xsect\@xsect

Like DBLP:journals/corr/BaydinCRSW17, we found that our hyperparameters converge extremely quickly. Further investigation is required to understand the dynamics of the higher-order hyperparameters. If there is indeed a compelling theoretical reason for this rapid convergence, it would suggest a form of higher-order “early stopping” where the hyperoptimizer monitors its hyperparameters’ convergence, and at some point decides to freeze its hyperparameters for the remainder of training. Besides the obvious performance improvement, this may allow the system to leverage the existing implicit regularization behavior exhibited by “vanilla” SGD.

\@xsect

While existing work on gradient-based hyperparameter optimization has primarily been evaluated in small-scale settings such as MNIST, automated hyperparameter tuning is particularly important in large-scale settings where training is computationally expensive, limiting the amount of manual hyperparameter tuning that can be done. Nonetheless, the choice of hyperparameters is still crucial: for example, a recent study improved significantly upon the state of the art in an NLP task simply by (manually) adjusting hyperparameters; indeed, they found that the performance was highly sensitive to Adam’s ϵ and β1 hyperparameters (RoBERTa). A natural next step, therefore, is investigating the effectiveness of higher-order hyperoptimization in automatically reproducing such results.

\@xsect

The hyper-regularizer of Luketina:2016:SGT:3045390.3045701 could be combined with the recursive “higher-order” approach described in this paper in order to derive highly robust regularizers. We note that there is a clear connection between hyper-regularizers and hyperpriors in Bayesian inference; we leave further study of this connection to future work.

\@xsect

In this paper, we presented a technique to enhance optimizers such as SGD and Adam by allowing them to tune their own hyperparameters by gradient descent. Unlike existing work, our proposed hyperoptimizers learn hyperparameters beyond just learning rates, require no manual differentiation by the user, and can be stacked recursively to many levels. We described in detail how to implement hyperoptimizers in a reverse-mode AD system. Finally, we demonstrated empirically three benefits of hyperoptimizers: that they outperform elementary optimizers, that they are less sensitive to human-chosen hyperparameters than elementary optimizers, and that they are highly scalable.

References

\@xsect

Below is the full source code for the implementation described in Section id1 and all the experiments reported in Section id1.

# hyperopt.py
import math
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class Optimizable:
    ”’
    This is the interface for anything that has parameters that need to be
    optimized, somewhat like torch.nn.Model but with the right plumbing for
    hyperoptimizability. (Specifically, torch.nn.Model uses the Parameter
    interface which does not give us enough control about the detachments.)
    Nominal operation of an Optimizable at the lowest level is as follows:
        o = MyOptimizable(…)
        o.initialize()
        loop {
            o.begin()
            o.zero_grad()
            loss = compute loss function from parameters
            loss.backward()
            o.adjust()
        }
    Optimizables recursively handle updates to their optimiz*ers*.
    ”’
    def __init__(self, parameters, optimizer):
        self.parameters = parameters # a dict mapping names to tensors
        self.optimizer = optimizer   # which must itself be Optimizable!
        self.all_params_with_gradients = []
    def initialize(self):
        ”’ Initialize parameters, e.g. with a Kaiming initializer. ”’
        pass
    def begin(self):
        ”’ Enable gradient tracking on current parameters. ”’
        for name, param in self.parameters.items():
            param.requires_grad_() # keep gradient information
            param.retain_grad()    # even if not a leaf
            self.all_params_with_gradients.append(param)
        self.optimizer.begin()
    def zero_grad(self):
        ”’ Set all gradients to zero. ”’
        for param in self.all_params_with_gradients:
            param.grad = torch.zeros(param.shape)
        self.optimizer.zero_grad()
    ”’ Note: at this point you would probably call .backwards() on the loss
    function. ”’
    def adjust(self):
        ”’ Update parameters ”’
        pass
class MNIST_FullyConnected(Optimizable):
    ”’
    A fully-connected NN for the MNIST task. This is Optimizable but not itself
    an optimizer.
    ”’
    def __init__(self, num_inp, num_hid, num_out, optimizer):
        parameters = {
            w1’: torch.zeros(num_inp, num_hid).t(),
            b1’: torch.zeros(num_hid).t(),
            w2’: torch.zeros(num_hid, num_out).t(),
            b2’: torch.zeros(num_out).t()
        }
        super().__init__(parameters, optimizer)
    def initialize(self):
        nn.init.kaiming_uniform_(self.parameters[’w1’], a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.parameters[’w2’], a=math.sqrt(5))
        self.optimizer.initialize()
    def forward(self, x):
        ””” Compute a prediction. ”””
        x = F.linear(x, self.parameters[’w1’], self.parameters[’b1’])
        x = torch.tanh(x)
        x = F.linear(x, self.parameters[’w2’], self.parameters[’b2’])
        x = torch.tanh(x)
        x = F.log_softmax(x, dim=1)
        return x
    def adjust(self):
        self.optimizer.adjust(self.parameters)
    def __str__(self):
        return mnist /  + str(self.optimizer)
class NoOpOptimizer(Optimizable):
    ”’
    NoOpOptimizer sits on top of a stack, and does not affect what lies below.
    ”’
    def __init__(self):
        pass
    def initialize(self):
        pass
    def begin(self):
        pass
    def zero_grad(self):
        pass
    def adjust(self, params):
        pass
    def __str__(self):
        return static
class SGD(Optimizable):
    ”’
    A hyperoptimizable SGD
    ”’
    def __init__(self, alpha=0.01, optimizer=NoOpOptimizer()):
        parameters = {’alpha’: torch.tensor(alpha)}
        super().__init__(parameters, optimizer)
    def adjust(self, params):
        self.optimizer.adjust(self.parameters)
        for name, param in params.items():
            g = param.grad.detach()
            params[name] = param.detach() - g * self.parameters[’alpha’]
    def __str__(self):
        return sgd(%f) / ’%self.parameters[’alpha’] + str(self.optimizer)
class SGDPerParam(Optimizable):
    ”’
    Like above, but can be taught a separate step size for each parameter it
    tunes.
    ”’
    def __init__(self, alpha=0.01, params=[], optimizer=NoOpOptimizer()):
        parameters = {name + _alpha’: torch.tensor(alpha) for name in params}
        super().__init__(parameters, optimizer)
    def adjust(self, params):
        self.optimizer.adjust(self.parameters)
        for name, param in params.items():
            g = param.grad.detach()
            params[name] = param.detach() - g * self.parameters[name+’_alpha’]
    def __str__(self):
        return sgd(%s) /  %\
            str({k:t.item() for k,t in self.parameters.items()}) +\
            str(self.optimizer)
class Adam(Optimizable):
    ”’
    A fully hyperoptimizable Adam optimizer
    ”’
    def clamp(x):
        return (x.tanh() + 1.) / 2.
    def unclamp(y):
        z = y * 2. - 1.
        return ((1. + z) / (1. - z)).log() / 2.
    def __init__(
        self,
        alpha=0.001, beta1=0.9, beta2=0.999, log_eps=-8.,
        optimizer=NoOpOptimizer()
    ):
        parameters = {
            alpha’: torch.tensor(alpha),
            beta1’: Adam.unclamp(torch.tensor(beta1)),
            beta2’: Adam.unclamp(torch.tensor(beta2)),
            log_eps’: torch.tensor(log_eps)
        }
        super().__init__(parameters, optimizer)
        self.num_adjustments = 0
        self.cache = {}
    def adjust(self, params):
        self.num_adjustments += 1
        self.optimizer.adjust(self.parameters)
        t = self.num_adjustments
        beta1 = Adam.clamp(self.parameters[’beta1’])
        beta2 = Adam.clamp(self.parameters[’beta2’])
        for name, param in params.items():
            if name not in self.cache:
                self.cache[name] = {
                    m’: torch.zeros(param.shape),
                    v’: torch.zeros(param.shape) +\
                            10.**self.parameters[’log_eps’].data
# NOTE that we add a little fudge factor here because sqrt is not
# differentiable at exactly zero
                }
            g = param.grad.detach()
            self.cache[name][’m’] = m =\
                beta1 * self.cache[name][’m’].detach() + (1. - beta1) * g
            self.cache[name][’v’] = v =\
                beta2 * self.cache[name][’v’].detach() + (1. - beta2) * g * g
            self.all_params_with_gradients.append(m)
            self.all_params_with_gradients.append(v)
            m_hat = m / (1. - beta1 ** float(t))
            v_hat = v / (1. - beta2 ** float(t))
            dparam = m_hat / (v_hat ** 0.5 + 10. ** self.parameters[’log_eps’])
            params[name] = param.detach() - self.parameters[’alpha’] * dparam
    def __str__(self):
        return adam(’ + str(self.parameters) + ’) /  + str(self.optimizer)
class AdamBaydin(Optimizable):
    ”’ Same as above, but only optimizes the learning rate, treating the
    remaining hyperparameters as constants. ”’
    def __init__(
        self,
        alpha=0.001, beta1=0.9, beta2=0.999, log_eps=-8.,
        optimizer=NoOpOptimizer()
    ):
        parameters = {
            alpha’: torch.tensor(alpha),
        }
        self.beta1 = beta1
        self.beta2 = beta2
        self.log_eps = log_eps
        super().__init__(parameters, optimizer)
        self.num_adjustments = 0
        self.cache = {}
    def adjust(self, params):
        self.num_adjustments += 1
        self.optimizer.adjust(self.parameters)
        t = self.num_adjustments
        beta1 = self.beta1
        beta2 = self.beta2
        for name, param in params.items():
            if name not in self.cache:
                self.cache[name] = {
                    m’: torch.zeros(param.shape),
                    v’: torch.zeros(param.shape) + 10.**self.log_eps
                }
            g = param.grad.detach()
            self.cache[name][’m’] = m =\
                beta1 * self.cache[name][’m’].detach() + (1. - beta1) * g
            self.cache[name][’v’] = v =\
                beta2 * self.cache[name][’v’].detach() + (1. - beta2) * g * g
            self.all_params_with_gradients.append(m)
            self.all_params_with_gradients.append(v)
            m_hat = m / (1. - beta1 ** float(t))
            v_hat = v / (1. - beta2 ** float(t))
            dparam = m_hat / (v_hat ** 0.5 + 10. ** self.log_eps)
            params[name] = param.detach() - self.parameters[’alpha’] * dparam
    def __str__(self):
        return adam(’ + str(self.parameters) + ’) /  + str(self.optimizer)
# main.py
import numpy as np
import json, math, time
from hyperopt import *
import gc
BATCH_SIZE = 300
mnist_train = torchvision.datasets.MNIST(
    ’./data’,
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor()
)
mnist_test = torchvision.datasets.MNIST(
    ’./data’,
    train=False,
    download=True,
    transform=torchvision.transforms.ToTensor()
)
dl_train = torch.utils.data.DataLoader(
    mnist_train,
    batch_size=BATCH_SIZE,
    shuffle=False
)
dl_test = torch.utils.data.DataLoader(
    mnist_test,
    batch_size=10000,
    shuffle=False
)
def test(model):
    for i, (features_, labels_) in enumerate(dl_test):
        features, labels = torch.reshape(features_, (10000, 28 * 28)), labels_
        pred = model.forward(features)
        return pred.argmax(dim=1).eq(labels).sum().item() / 10000 * 100
def train(model, epochs=3, height=1):
    stats = []
    for epoch in range(epochs):
        for i, (features_, labels_) in enumerate(dl_train):
            t0 = time.process_time()
            model.begin()
            features, labels =\
                torch.reshape(features_, (BATCH_SIZE, 28 * 28)), labels_
            pred = model.forward(features)
            loss = F.nll_loss(pred, labels)
            model.zero_grad()
            loss.backward(create_graph=True)
            model.adjust()
            tf = time.process_time()
            data = {
                time’: tf - t0,
                iter’: epoch * len(dl_train) + i,
                loss’: loss.item(),
                params’: {
                    k: v.item() for k, v in model.optimizer.parameters.items()
                    if ’.’ not in k
                }
            }
            stats.append(data)
    return stats
def run(opt, name=’out’, usr={}, epochs=3, height=1):
    torch.manual_seed(0x42)
    model = MNIST_FullyConnected(
        28 * 28, 128, 10, opt
    )
    print(’Running…’, str(model))
    model.initialize()
    log = train(model, epochs, height)
    acc = test(model)
    out = {’acc’: acc, log’: log, usr’: usr}
    with open(’log/%s.json % name, w’) as f:
        json.dump(out, f, indent=True)
    times = [x[’time’] for x in log]
    print(’Times (ms): ’, np.mean(times), ’+/-’, np.std(times))
    print(’Final accuracy:’, acc)
    return out
def sgd_experiments():
    run(SGD(0.01), sgd’, epochs=1)
    out = run(SGD(0.01, optimizer=SGD(0.01)), sgd+sgd’, epochs=1)
    alpha = out[’log’][-1][’params’][’alpha’]
    print(alpha)
    run(SGD(alpha), sgd-final’, epochs=1)
def adam_experiments():
    run(Adam(), adam’, epochs=1)
    print()
    mo = SGDPerParam(
        0.001, [’alpha’, beta1’, beta2’, log_eps’], optimizer=SGD(0.0001)
    )
    out = run(Adam(optimizer=mo), adam+sgd’, epochs=1)
    p = out[’log’][-1][’params’]
    alpha = p[’alpha’]
    beta1 = Adam.clamp(torch.tensor( p[’beta1’] )).item()
    beta2 = Adam.clamp(torch.tensor( p[’beta2’] )).item()
    log_eps = p[’log_eps’]
    print(alpha, beta1, beta2, log_eps)
    print(mo)
    run(
        Adam(alpha=p[’alpha’], beta1=beta1, beta2=beta2, log_eps=log_eps),
        adam+sgd-final’, epochs=1
    )
    print()
    out = run(Adam(optimizer=Adam()), adam2’, epochs=1)
    p = out[’log’][-1][’params’]
    alpha = p[’alpha’]
    beta1 = Adam.clamp(torch.tensor( p[’beta1’] )).item()
    beta2 = Adam.clamp(torch.tensor( p[’beta2’] )).item()
    log_eps = p[’log_eps’]
    print(alpha, beta1, beta2, log_eps)
    run(
        Adam(alpha=p[’alpha’], beta1=beta1, beta2=beta2, log_eps=log_eps),
        adam2-final’, epochs=1
    )
    print()
    mo = SGDPerParam(0.001, [’alpha’], optimizer=SGD(0.0001))
    out = run(AdamBaydin(optimizer=mo), adambaydin+sgd’, epochs=1)
    p = out[’log’][-1][’params’]
    alpha = p[’alpha’]
    print(alpha)
    print(mo)
    run(Adam(alpha=p[’alpha’]), adambaydin+sgd-final’, epochs=1)
    print()
    out = run(AdamBaydin(optimizer=Adam()), adambaydin2’, epochs=1)
    p = out[’log’][-1][’params’]
    alpha = p[’alpha’]
    print(alpha)
    run(Adam(alpha=p[’alpha’]), adambaydin2-final’, epochs=1)
def surface():
    run(SGD(10 ** -3, optimizer=SGD(10**-1)), tst’, epochs=1)
    for log_alpha in np.linspace(-3, 2, 10):
        run(
          SGD(10 ** log_alpha),
          [email protected]%+.2f % log_alpha,
          epochs=1
        )
def make_sgd_stack(height, top):
    if height == 0:
        return SGD(alpha=top)
    return SGD(alpha=top, optimizer=make_sgd_stack(height - 1, top))
def make_adam_stack(height, top=0.0000001):
    if height == 0:
        return Adam(alpha=top)
    return Adam(alpha=top, optimizer=make_adam_stack(height - 1))
def stack_test():
    for top in np.linspace(-7, 3, 20):
        for height in range(6):
            print(’height =’, height, top =’, top)
            opt = make_sgd_stack(height, 10**top)
            run(
                opt, metasgd3-%[email protected]%+.2f % (height, top),
                {’height’: height, top’: top}, epochs=1, height=height
            )
            gc.collect()
def perf_test():
    for h in range(51):
        print(’height:’, h)
        #opt = make_sgd_stack(h, 0.01)
        opt = make_adam_stack(h)
        run(opt, adamperf-%d % h, {’height’: h}, epochs=1)
        gc.collect()
if __name__ == __main__’:
    surface()
    sgd_experiments()
    adam_experiments()
    stack_test()
    perf_test()