Training Transformers with Enforced Lipschitz Constants

Abstract

Neural networks are often highly sensitive to input and weight perturbations.This sensitivity has been linked to pathologies such as vulnerability toadversarial examples, divergent training, and overfitting. To combat theseproblems, past research has looked at building neural networks entirely fromLipschitz components. However, these techniques have not matured to the pointwhere researchers have trained a modern architecture such as a transformer witha Lipschitz certificate enforced beyond initialization. To explore this gap, webegin by developing and benchmarking novel, computationally-efficient tools formaintaining norm-constrained weight matrices. Applying these tools, we are ableto train transformer models with Lipschitz bounds enforced throughout training.We find that optimizer dynamics matter: switching from AdamW to Muon improvesstandard methods -- weight decay and spectral normalization -- allowing modelsto reach equal performance with a lower Lipschitz bound. Inspired by Muon'supdate having a fixed spectral norm, we co-design a weight constraint methodthat improves the Lipschitz vs. performance tradeoff on MLPs and 2M parametertransformers. Our 2-Lipschitz transformer on Shakespeare text reachesvalidation accuracy 60%. Scaling to 145M parameters, our 10-Lipschitztransformer reaches 21% accuracy on internet text. However, to match theNanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper boundincreases to 10^264. Nonetheless, our Lipschitz transformers train withoutstability measures such as layer norm, QK norm, and logit tanh softcapping.

Quick Read (beta)

loading the full paper ...