4-bit Shampoo for Memory-Efficient Network Training

Abstract

Second-order optimizers, maintaining a matrix termed a preconditioner, aresuperior to first-order optimizers in both theory and practice. The statesforming the preconditioner and its inverse root restrict the maximum size ofmodels trained by second-order optimizers. To address this, compressing 32-bitoptimizer states to lower bitwidths has shown promise in reducing memory usage.However, current approaches only pertain to first-order optimizers. In thispaper, we propose the first 4-bit second-order optimizers, exemplified by 4-bitShampoo, maintaining performance similar to that of 32-bit ones. We show thatquantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo isremarkably better than quantizing the preconditioner itself both theoreticallyand experimentally. By rectifying the orthogonality of the quantizedeigenvector matrix, we enhance the approximation of the preconditioner'seigenvector matrix, which also benefits the computation of its inverse 4-throot. Besides, we find that linear square quantization slightly outperformsdynamic tree quantization when quantizing second-order optimizer states.Evaluation on various networks for image classification and natural languagemodeling demonstrates that our 4-bit Shampoo achieves comparable performance toits 32-bit counterpart while being more memory-efficient.

Quick Read (beta)

loading the full paper ...