NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Abstract

The performance of neural networks improves when more parameters are used.However, the model sizes are constrained by the available on-device memoryduring training and inference. Although applying techniques like quantizationcan alleviate the constraint, they suffer from performance degradation. In thiswork, we introduce NeuZip, a new weight compression scheme based on the entropyof floating-point numbers in neural networks. With NeuZip, we are able toachieve memory-efficient training and inference without sacrificingperformance. Notably, we significantly reduce the memory footprint of traininga Llama-3 8B model from 31GB to less than 16GB, while keeping the trainingdynamics fully unchanged. In inference, our method can reduce memory usage bymore than half while maintaining near-lossless performance. Our code ispublicly available.

Quick Read (beta)

loading the full paper ...