Abstract
As text-to-image models grow increasingly powerful and complex, theirburgeoning size presents a significant obstacle to widespread adoption,especially on resource-constrained devices. This paper presents a pioneeringstudy on post-training pruning of Stable Diffusion 2, addressing the criticalneed for model compression in text-to-image domain. Our study tackles thepruning techniques for the previously unexplored multi-modal generation models,and particularly examines the pruning impact on the textual component and theimage generation component separately. We conduct a comprehensive comparison onpruning the model or the single component of the model in various sparsities.Our results yield previously undocumented findings. For example, contrary toestablished trends in language model pruning, we discover that simple magnitudepruning outperforms more advanced techniques in text-to-image context.Furthermore, our results show that Stable Diffusion 2 can be pruned to 38.5%sparsity with minimal quality loss, achieving a significant reduction in modelsize. We propose an optimal pruning configuration that prunes the text encoderto 47.5% and the diffusion generator to 35%. This configuration maintains imagegeneration quality while substantially reducing computational requirements. Inaddition, our work uncovers intriguing questions about information encoding intext-to-image models: we observe that pruning beyond certain thresholds leadsto sudden performance drops (unreadable images), suggesting that specificweights encode critical semantics information. This finding opens new avenuesfor future research in model compression, interoperability, and biasidentification in text-to-image models. By providing crucial insights into thepruning behavior of text-to-image models, our study lays the groundwork fordeveloping more efficient and accessible AI-driven image generation systems