Abstract
In recent years, Dynamic Sparse Training (DST) has emerged as an alternativeto post-training pruning for generating efficient models. In principle, DSTallows for a more memory efficient training process, as it maintains sparsitythroughout the entire training run. However, current DST implementations failto capitalize on this in practice. Because sparse matrix multiplication is muchless efficient than dense matrix multiplication on GPUs, most implementationssimulate sparsity by masking weights. In this paper, we leverage recentadvances in semi-structured sparse training to apply DST in the domain ofclassification with large output spaces, where memory-efficiency is paramount.With a label space of possibly millions of candidates, the classification layeralone will consume several gigabytes of memory. Switching from a dense to afixed fan-in sparse layer updated with sparse evolutionary training (SET);however, severely hampers training convergence, especially at the largest labelspaces. We find that poor gradient flow from the sparse classifier to the densetext encoder make it difficult to learn good input representations. Byemploying an intermediate layer or adding an auxiliary training objective, werecover most of the generalisation performance of the dense model. Overall, wedemonstrate the applicability and practical benefits of DST in a challengingdomain -- characterized by a highly skewed label distribution that differssubstantially from typical DST benchmark datasets -- which enables end-to-endtraining with millions of labels on commodity hardware.