Working with any gradient-based machine learning algorithm involves thetedious task of tuning the optimizer's hyperparameters, such as the learningrate. There exist many techniques for automated hyperparameter optimization,but they typically introduce even more hyperparameters to control thehyperparameter optimization process. We propose to instead learn thehyperparameters themselves by gradient descent, and furthermore to learn thehyper-hyperparameters by gradient descent as well, and so on ad infinitum. Asthese towers of gradient-based optimizers grow, they become significantly lesssensitive to the choice of top-level hyperparameters, hence decreasing theburden on the user to search for optimal values.
Quick Read (beta)
Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer’s hyperparameters, such as the learning rate. There exist many techniques for automated hyperparameter optimization, but they typically introduce even more hyperparameters to control the hyperparameter optimization process. We propose to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn the hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As these towers of gradient-based optimizers grow, they become significantly less sensitive to the choice of top-level hyperparameters, hence decreasing the burden on the user to search for optimal values.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.
Gradient Descent: The Ultimate Optimizer
Kartik Chandra 0 Erik Meijer 0
Samantha Andow 0 Emilio Arroyo-Fang 0 Irene Dea 0 Johann George 0 Melissa Grueter 0 Basil Hosmer 0 Steffi Stumpos 0 Alanna Tempest 0 Shannon Yang 0
Preprint under review.
Copyright 2019 by the author(s).\@xsect
Usually we think of using gradient descent to optimize weights and other parameters of neural networks. Differentiable programming languages promise to make arbitrary code differentiable, allowing us to use gradient descent to optimize any program parameter that would otherwise be hard-coded by a human. Hence, there is no reason we should not be able to use gradient descent to optimize quantities other than the weights of a neural network, for instance hyperparameters like the gradient descent step size. But we don’t need to stop there, and we can just as well learn the hyper-hyperparameters used to optimize those hyperparameters, along with other constants occurring in gradient descent optimizers (DBLP:journals/corr/Ruder16).
In this paper we show that differentiable programming makes it practical to tune arbitrarily tall recursive towers of optimizers, where each optimizer adjusts the hyperparameters of its descendant:
Like DBLP:journals/corr/BaydinCRSW17, we independently rediscovered the idea of Almeida1999ParameterAI to implement efficient on-line hyperparameter optimizers by gradient descent. However, we generalize the approach of DBLP:journals/corr/BaydinCRSW17 in several dimensions.
In Section id1 we show how to craft the automatic differentiation (AD) computation graph such that the calculations to derive the hyperparameter update formula performed manually by DBLP:journals/corr/BaydinCRSW17 come “for free” as a result of reverse-mode automatic differentiation, just like the update rule for the weights does. This eliminates the need for certain tedious manual computations.
In Section id1 we utilize this newfound power to differentiate with respect to hyperparameters beyond just the learning rate, such as Adam’s , and , and in Section id1 show empirically that learning these extra hyperparameters improves results.
Furthermore, in Section id1, we realize the vision of recursively stacking multiple levels of hyperparameter optimizers that was only hypothesized by DBLP:journals/corr/BaydinCRSW17 Hyperparameter optimizers can themselves be optimized, as can their optimizers, and so on ad infinitum. We demonstrate empirically in Section id1 that such towers of optimizers are scalable to many recursive levels.
Section id1 shows that taller stacks of hyperoptimizers are indeed significantly less sensitive to the choice of top-level hyperparameters. This reduces the burden on humans responsible for tuning the hyperparameters — rather than “seeking needles in a haystack,” we can instead simply “shoot fish in a barrel.”
What does it mean to optimize an optimizer? Consider Figure 1, which depicts a “hyperoptimization surface” for using stochastic gradient descent (SGD) to optimize some loss function . Each thin trace is a loss curve of SGD with the given step size hyperparameter . These loss curves form cross-sections of the surface along the axis, parallel to the batch-number/loss plane.
Notice that with SGD performs poorly because the parameters update too slowly to make meaningful progress by the end of the training period. Similarly, for SGD also performs poorly because the parameter updates are too noisy to converge to the optimum. The optimal hyperparameter is therefore somewhere between and . If we were training this model, we would have to manually discover this range by experimentation.
Instead, imagine if we could utilize a variant of SGD to climb down this surface no matter where we started, as demonstrated by the thick orange trace. Unlike the thin traces of vanilla SGD, the thick orange trace is not confined to a single plane — though it begins at the highly sub-optimal , it gradually “learns” to increase , and attains a final loss function on par with the optimal hyperparameter for vanilla SGD. In Section id1 we will describe how to achieve this by adjusting at each step of SGD.
Note that our approach is not limited to tuning just step sizes. The Adam optimizer, for example, already intelligently adjusts the step size for each parameter based on past progress (Adam). However, Adam still has its own fixed hyperparameters: the learning rate , the two moment coefficients , and the factor used to avoid division by zero. For instance, the recommended default for is often quoted as , but as the TensorFlow documentation remarks, sometimes it is better to use or instead. As we show in Section id1, we can tune these additional hyperparameters automatically.
Existing work (Maclaurin:2015:GHO:3045118.3045343; Pedregosa:2016:HOA:3045390.3045469; pmlr-v70-franceschi17a), attempts to learn a single optimal hyperparameter for the entire training history — by gradient descent on the dashed black “U” in Figure 1. This is inefficient because it requires memory to store the entire unrolled run. Our work, along with DBLP:journals/corr/BaydinCRSW17 and Almeida1999ParameterAI, uses a stochastic variant of the above instead: perform incremental updates to the hyperparameter in parallel with the learning. Since each incremental update only depends on its immediate history, we can “forget” all but a constant amount of information of the unrolled run that non-stochastic approaches have to “remember” and fully differentiate through.
Consider some stochastic loss function that we want to minimize using gradient descent, and let be the weights at the beginning of step . Let us first recall the standard weight update rule at step for SGD, using some (fixed) step size :
We would like to update as well at each step, so we will index it with the step number also: let be the step size at the beginning of step . At each step, we will first update the step size to using some update rule yet to be derived, and then use the updated step size to update the weights from to .
What should the adjustment for be? By analogy to , we want to adjust in the direction of the gradient of the loss function with respect to , scaled by some hyper-step size . In other words, the adjustment should be . Section id1 addresses the practical matter of the selection of this hyper-hyperparameter — for now, we will take as a given fixed constant. Our modified update rule is therefore:
All that remains is to compute in equation (1).
In the next section, we review how DBLP:journals/corr/BaydinCRSW17 compute this derivative by hand, obtaining an elegant and efficiently-computable expression. In the section that follows, we show how we can compute the partial derivative for the step size update completely automatically, exactly like the partial derivative for the weights. This makes it possible to generalize our approach in many different ways.
One option to compute , explored by DBLP:journals/corr/BaydinCRSW17, is to proceed by direct manual computation of the partial derivative. Applying the chain rule to the derivative in question, we can compute
where (4) is obtained by substituting the update rule in (2) for and (5) is obtained by observing that does not depend on , and can therefore be treated as a constant. In this particular case, the resulting expression is simple and elegant: the dot product of the preceding two gradients with respect to the weights, which as we see from equation (2) would have already been computed in order to update the weights themselves. We only need to remember the previous derivate, so this is very memory efficient and time efficient.
The direct-computation strategy works well when the update rule is easy to differentiate by hand with respect to the hyperparameter — in SGD as above, it is simply a multiplication by a constant, whose derivative is trivial. However, this is not always the case. Consider, for example, the update rule for the Adam optimizer, as in Algorithm 1 of Adam, which has a much more complicated dependence on the hyperparameters and . Differentiating the update rule by hand, we obtain the following results, caveat emptor:
We see how again the derivatives of the loss function with respect to the hyperparameters and are defined in terms of the previous value of the the derivative for the actual parameters , but embedded within a much more complex expression than before. Clearly, this manual approach to compute the hyperparameter update rules does not scale. However, with a little bit of care we can actually compute these derivatives automatically by backwards AD, just like we do for regular parameters.
In order to compute automatically, let us first briefly review the operational mechanics of reverse-mode automatic differentiation. Frameworks that provide reverse-mode AD (Griewank) to compute do so by building up a backwards computation graph as the function is computed forwardly. For example, when a user computes the loss function , the framework internally stores a DAG whose leaves are the weights , whose internal nodes are intermediate computations (for a DNN the outputs of each successive layer), and whose root is the concrete loss function such as LogSoftmax. The framework can then backpropagate from the backwards computation graph created for this root node, depositing gradients in each node as it descends, until the weights at the leaf nodes have accumulated the gradient .
Once the gradient is computed by the backwards pass, we can then continue to update the weights as shown above, and repeat the cycle for the next training batch. However, an important consideration is for the weights to be “detached” from the computation graph before each iteration of this algorithm — that is, for the weights to be forcibly converted to leaves of the graph by removing any inbound edges. The effect of the “detach” operation is depicted in Figure 2. If this step is skipped, the next backpropagation iteration will continue beyond the current weights into the past. This is problematic in a couple of ways, depending on how the weight update is implemented. If the weight update is implemented as an in-place operation, then this will yield incorrect results as more and more gradients get accumulated onto the same node of the computation graph. If the weight update is implemented by creating a fresh node, then over time the computation graph will grow taller linearly in the number of steps taken; because backpropagation is linear in the size of the graph, the overall training would become quadratic-time and intractable.
Let’s peek inside the implementation of SGD in PyTorch (PyTorch) as of commit bb41e6 to see how this cutting-off is implemented in the actual source code:
Here, p represents the parameter being optimized (i.e. ) and lr is the learning rate (i.e. ). A few more PyTorch-specific clarifications: p.grad retrieves the gradient of the loss function with respect to p. The call to .add_ updates p in-place with the product of the arguments, that is, with . Most importantly, the highlighted calls to .data implement the “detachment” by referring to the datastore of that variable directly, ignoring the associated computation graph information. Note that in vanilla PyTorch the step size is not learned, so no call to .data is required for the learning rate because it is internally stored as a raw Python float rather than a differentiable variable tracked by PyTorch.
For the sake of consistency let us rewrite this function, renaming variables to match the above discussion and promoting alpha to a differentiable variable in the form of a rank-0 tensor. In order to keep the computation graph clear, we will also update weights by creating fresh nodes rather than to change them in-place.
The highlighted calls to .detach() correspond to detaching the weights and their gradients. Now, in order to have backpropagation deposit the gradient with respect to as well as , we can simply refrain from detaching from the graph, detaching instead its parents. This is depicted in Figure 2.
Notice in particular that because we want to compute the edge from to needs to remain intact. To implement this, instead of calling .detach() on alpha directly, we instead call .detach() on its parents when adjusting it using equation (1). This change yields the following fully-automated hyperoptimization algorithm11 1 The example code in this section elides some small PyTorch-related details. Appendix id1 contains the full PyTorch source code for all examples and experiments in this paper.: ⬇ def HyperSGD.adjust(w): # update alpha using Equation (1) d_alpha = self.alpha.grad.detach() self.alpha = self.alpha.detach() - kappa.detach() * d_alpha # update w using Equation (2) d_w = w.grad.detach() w = w.detach() - self.alpha.detach() * d_w Notice that because we are only extending the computation graph by a little extra amount (corresponding to evaluating the optimizer), the backwards AD pass should not be significantly more computationally expensive. Section id1 presents an empirical evaluation of the computational cost to hyperoptimization.
As suggested in previous work (Maclaurin:2015:GHO:3045118.3045343), it should be possible to apply gradient-based methods for tuning hyperparameters of common variations on SGD such as AdaGrad (AdaGrad), AdaDelta (AdaDelta), or Adam (Adam). The above implementation of HyperSGD generalizes quite easily to these optimizers. In this section we demonstrate the HyperAdam optimizer, which mostly follows by analogy to HyperSGD. Unlike previous work, which could only optimize Adam’s learning rate (DBLP:journals/corr/BaydinCRSW17), we are able to optimize all four hyperparameters of Adam automatically. Our evaluation in Section id1 demonstrates that this indeed useful to do.
There are, however, two important subtleties to be aware of.
First, because the hyperparameters and must be strictly in the domain , we clamp the “raw” values to this domain using a scaled sigmoid. Without this step, we might accidentally adjust these values outside their domains, which ultimately leads to arithmetic exceptions.
Second, the Adam optimizer involves the term , which is continuous but not differentiable at . Because Adam normally initializes , backpropagation would fail on the very first step due to a division by zero error. We fix this problem by initializing to rather than 0.
These two subtleties reveal a limitation of our automatic approach to hyperparameter optimization: while the domain restrictions are evident in the explicit formulas presented above (notice, for example, the in the denominator of the expression for derived above, which would immediately to signal a user the potential for division-by-zero), they are more difficult to predict and debug if the derivatives are taken automatically. The hyperparameter update rule does not “know” the domains of the hyperparameters, and so it might step too far and lead to a mysterious crash or nan issue. In practice however, this has not been a showstopper.
Implementing these fixes, and remembering to .detach() the Adam intermediate state (i.e. and ) in the right place to prevent “leaks” in the backwards AD, we obtain the following implementation:
At this point it is natural to ask whether the hyperoptimizer can itself be optimized; that is, whether the human-selected hyper-hyperparameter to update the hyperparameters (e.g. ) can be adjusted by a hyper-hyperoptimizer. The possibility of doing so recursively ad infinitum to obtain an optimization algorithm that is highly robust to the top-level human-chosen hyperparameter was hypothesized in Section 5.2 of DBLP:journals/corr/BaydinCRSW17. Computing the gradients of these higher-order hyperparameters by hand is impossible without knowing the exact sequence of stacked optimizers ahead of time, and as we have shown above, will be extremely tedious and error prone.
However, the ability to compute these gradients automatically by backwards AD makes it possible to realize this vision. To do so, let us revisit our previous implementation of HyperSGD. Notice that there is an opportunity for recursion lurking here: the adjustment to alpha can be factored out with a call to SGD.adjust, where SGD’s hyperparameter is kappa.
Because SGD is already careful to properly detach its parameter (typically , but in this case ), this implementation is functionally identical to the one above. Indeed, any optimizer that observes this protocol would suffice, so let us abstract out the optimizer as a parameter to HyperSGD:
After this refactoring, finally, we can recursively feed HyperSGD itself as the optimizer, obtaining a level-2 hyperoptimizer HyperSGD(0.01, HyperSGD(0.01, SGD(0.01))). Similarly, we can imagine taller towers, or towers that mix and match multiple different kinds of optimizers, such as Adam-optimized-by-SGD-optimized-by-Adam.
A natural application of this idea is to automatically learn hyperparameters on a per-parameter basis. For example, when hyperoptimizing Adam with SGD as in Section id1, it is extremely beneficial to maintain a separate hyper-step size (i.e. a separate ) for each of the four hyperparameters, since they typically span many orders of magnitude. Instead of specifying each as a separate top-level hyperparameter, however, we can instead apply a second level of SGD that lets the system automatically learn optimal hyper-step sizes for each Adam hyperparameter separately.
A logical concern is whether this process actually exacerbates the hyperparameter optimization problem by introducing even more hyperparameters. DBLP:journals/corr/BaydinCRSW17 predicted that as the towers of hyperoptimizers grow taller, the resulting algorithms would be less sensitive to the human-chosen hyperparameters, and therefore the overall burden on the user will be reduced. This indeed seems to be the case; Section id1 presents an empirical evaluation of this hypothesis.
|SGD(0.01) / SGD(0.01)||88.35%||16ms|
|SGD(0.01) / Adam(…)||86.80%||23ms|
|Adam(…) — baseline||91.09%||38ms|
|Adam(…) / SGD() / SGD()||92.74%||43ms|
|Adam( 0.0291, 0.8995, 0.999, -8)||93.74%||40ms|
|Adam(…) / SGD() / SGD()||92.64%||37ms|
|Adam( 0.0284, *)||93.42%||41ms|
|Adam(…) / Adam(…)||94.35%||41ms|
|Adam( 0.013, 0.892, 0.998, -8)||94.75%||36ms|
|Adam(…) / Adam(…)||94.01%||31ms|
|Adam( 0.013, …)||94.39%||32ms|
In this section we evaluate the hyperoptimizers made possible by our system, exploring in particular the benefits of being able to optimize hyperparameters beyond just step size and of higher-order optimization, as well as whether or not there is a significant computational cost to automatically computing derivatives with respect to hyperparameters.
Like authors of previous work Maclaurin:2015:GHO:3045118.3045343; DBLP:journals/corr/BaydinCRSW17, we conducted all of our experiments were conducted on the MNIST dataset (MNIST) using a neural network with one fully-connected hidden layer of size 128, activations, and a batch size of 300 run for a single epoch. We implemented the system in PyTorch and ran experiments on a 2.4 GHz Intel CPU with 32GB of memory. The full source code for each of these experiments is presented in Appendix id1.
We denote species of hyperoptimizers by their sequence of constituent optimizers with their initial hyperparameters. The leftmost item adjusts the parameters of the model whereas the rightmost item has fixed hyperparameters. For example, the term “SGD(0) / Adam(0.001, 0.9, 0.999, -8)” indicates that the weights of the neural network were adjusted by stochastic gradient descent with a step size that, while initially 0, was adjusted by a regular Adam optimizer with hyperparameters . Adam denotes an Adam optimizer where only is optimized as in DBLP:journals/corr/BaydinCRSW17, and the abbreviations Adam(…) and Adam(…) denote the respective optimizers with the standard hyperparameters , which are recommended by Adam and are used by default almost universally across software packages.
Here we seek to answer two questions: (1) whether an SGD hyperoptimizer performs better than an elementary SGD optimizer22 2 Following previous work (Maclaurin:2015:GHO:3045118.3045343; DBLP:journals/corr/BaydinCRSW17), we refer to standard, vanilla “non-hyperoptimized” optimizers as “elementary optimizers.”, and (2) whether or not the learned step size outperforms the initial step size. We test the latter property by running a fresh elementary SGD optimizer with the final learned step size of the hyperoptimizer.
Table 1 summarizes the results of our experiments, run with an initial step size of 0.01 (see Section id1 for a discussion of the sensitivity of these results to this initial step size). We find that hyperoptimized SGD outperforms the baseline by a significant margin (nearly 10%). This holds even if we use an Adam optimizer to adjust the step size of the SGD optimizer.
Furthermore, when we re-ran the elementary optimizers with the new learned hyperparameters, we found that they typically performed incrementally better than the hyperparameter itself. This is what Luketina:2016:SGT:3045390.3045701 refer to as the “hysteresis” effect of hyperparameter optimization: we cannot reap the benefits of the optimized hyperparameters while they are themselves still in the early stages of being optimized — thus, hyperoptimizers should “lag” slightly behind elementary optimizers that start off with the final optimized hyperparameters.
In Section id1, we described how to apply our system to optimizing the Adam optimizer, which maintains first- and second-order momentum information for each parameter it optimizes. Altogether, there are four hyperparameters: a learning rate (), coefficients for the first- and second-order momenta (), and an epsilon value (). We tune all four simultaneously, first using SGD and then by using Adam itself on the top-level. Our Adam / SGD experiments utilize the higher-order design proposed at the end of Section id1 to learn a separate hyper-step size for each hyperparameter of Adam; this is not needed for Adam / Adam because Adam by design already maintains separate information for each parameter it optimizes.
We seek to answer three questions: (1) whether hyperoptimized Adam optimizers perform better than elementary Adam optimizers, (2) whether the learned hyperparameters outperform the baseline, and (3) whether there is a benefit to optimizing all four hyperparameters, as opposed to only optimizing the learning rate as DBLP:journals/corr/BaydinCRSW17 do.
Table 2 summarizes the results of our experiments. We find that indeed the hyperoptimized Adam optimizer outperforms the elementary Adam optimizer on its “default” settings. As with SGD in Section id1, the learned hyperparameters perform incrementally better than the hyperoptimizer due to the hysteresis effect.
Inspecting the learned hyperparameters, we find that the algorithm significantly raises the learning rate and slightly lowers , but does not significantly affect either or . Nevertheless, learning does provide a noticeable benefit: our hyperoptimized Adam outperforms hyperoptimized Adam, which can only learn . Both hyperoptimizers learn similar optimized values for , but Adam cannot also adapt , and therefore does not perform as well.
In Section id1 we developed an interface for building arbitrarily tall towers of optimizers. Recall that DBLP:journals/corr/BaydinCRSW17 hypothesized that taller towers would yield hyperoptimizers that were more robust to the top-level human-chosen hyperparameters than elementary optimizers are.
To validate this behavior of higher-order hyperoptimizers, we ran the above benchmark with towers of SGD-based hyperoptimizers of increasing heights, where each layer of SGD started with the same initial step size . Figure 3 shows the results of this experiment. It is indeed the case that the taller the hyperoptimizer stack, the less sensitive the results become for the top-level hyper-parameters; roughly one order of magnitude per step until after 5-6 levels the graphs converge.
Notice also that the sensitivity only decreases for smaller initial step sizes; all hyperoptimizers performed poorly beyond . We hypothesize that it is difficult to recover from a too-high initial step size because dramatic changes in parameters at each step make the stochastic loss function too noisy. In comparison, if the hyperoptimizer’s initial step size is too low, then the weights do not change very dramatically at each step, and as a result a “signal” to increase the step size can be extracted from a series of stochastic gradient descent steps.
When we stack a new hyperparameter optimizer, we are effectively adding a new layer to the computation graph. This corresponds to extending each step of Figure 2 further to the left by yet another node. Thus, with a hyperoptimizer stack of height we would obtain a computation graph of size at each step, which means backpropagation takes time . We should therefore expect training with a stack of hyperoptimizers to take time .
To test this hypothesis, we extended the benchmark from Section id1 to much taller stacks, up to height 50. Note that these experiments are meant to stress-test the system with significantly taller stacks than would typically be necessary (recall from Section id1 that the stacks of hyperoptimizers need not be taller than 3-4 to be highly effective).
As shown in Figure 4, higher-order hyperoptimization is indeed asymptotically linear-time in the height of the optimizer stack. Note how the slope of this linear relationship is quite small compared to the fixed computational cost of backpropagating through the loss function. This makes sense: the additional work at each level is only the computational cost of backpropagating through the new top-level optimizer, which is typically much simpler than the machine learning model itself. Indeed, the difference in slopes between higher-order SGD in Figure 4 and higher-order Adam in Figure 4 is simply because Adam is a more complex optimizer, requiring more computation to differentiate through.
In summary, we find that in practice higher-order hyperoptimization is an extremely lightweight addition to any machine learning model with great benefits.
Hyperparameter optimization has a long history, and we refer readers interested in the full story to a recent survey by Feurer2019.
Most existing work on gradient-based hyperparameter optimization (BengioHyperopt; domke; Maclaurin:2015:GHO:3045118.3045343; Pedregosa:2016:HOA:3045390.3045469; pmlr-v70-franceschi17a) has focused on computing hyperparameter gradients after several iterations of training, which is computationally expensive because of the need to backpropagate through much more computation. DBLP:journals/corr/BaydinCRSW17, building on a technique first published by Almeida1999ParameterAI, propose instead updating hyperparameters at each step. Luketina:2016:SGT:3045390.3045701 apply a similar technique to regularization hyperparameters, though they explicitly note that their proposed method could work in principle for any continuous hyperparameter.
As discussed above, we expand upon this latter line of work in three directions: (1) by optimizing hyperparameters beyond just the learning rate; (2) by fully automating this process, rather than requiring manual derivative computations; and (3) by realizing the vision of recursively constructing higher-order hyperoptimizers and evaluating the resulting algorithms.
Like DBLP:journals/corr/BaydinCRSW17, we found that our hyperparameters converge extremely quickly. Further investigation is required to understand the dynamics of the higher-order hyperparameters. If there is indeed a compelling theoretical reason for this rapid convergence, it would suggest a form of higher-order “early stopping” where the hyperoptimizer monitors its hyperparameters’ convergence, and at some point decides to freeze its hyperparameters for the remainder of training. Besides the obvious performance improvement, this may allow the system to leverage the existing implicit regularization behavior exhibited by “vanilla” SGD.
While existing work on gradient-based hyperparameter optimization has primarily been evaluated in small-scale settings such as MNIST, automated hyperparameter tuning is particularly important in large-scale settings where training is computationally expensive, limiting the amount of manual hyperparameter tuning that can be done. Nonetheless, the choice of hyperparameters is still crucial: for example, a recent study improved significantly upon the state of the art in an NLP task simply by (manually) adjusting hyperparameters; indeed, they found that the performance was highly sensitive to Adam’s and hyperparameters (RoBERTa). A natural next step, therefore, is investigating the effectiveness of higher-order hyperoptimization in automatically reproducing such results.
The hyper-regularizer of Luketina:2016:SGT:3045390.3045701 could be combined with the recursive “higher-order” approach described in this paper in order to derive highly robust regularizers. We note that there is a clear connection between hyper-regularizers and hyperpriors in Bayesian inference; we leave further study of this connection to future work.
In this paper, we presented a technique to enhance optimizers such as SGD and Adam by allowing them to tune their own hyperparameters by gradient descent. Unlike existing work, our proposed hyperoptimizers learn hyperparameters beyond just learning rates, require no manual differentiation by the user, and can be stacked recursively to many levels. We described in detail how to implement hyperoptimizers in a reverse-mode AD system. Finally, we demonstrated empirically three benefits of hyperoptimizers: that they outperform elementary optimizers, that they are less sensitive to human-chosen hyperparameters than elementary optimizers, and that they are highly scalable.
Below is the full source code for the implementation described in Section id1 and all the experiments reported in Section id1.