Abstract
The softmax (also called softargmax) function is widely used in machinelearning models to normalize real-valued scores into a probabilitydistribution. To avoid floating-point overflow, the softmax function isconventionally implemented in three passes: the first pass to compute thenormalization constant, and two other passes to compute outputs from normalizedinputs. We analyze two variants of the Three-Pass algorithm and demonstratethat in a well-optimized implementation on HPC-class processors performance ofall three passes is limited by memory bandwidth. We then present a novelalgorithm for softmax computation in just two passes. The proposed Two-Passalgorithm avoids both numerical overflow and the extra normalization pass byemploying an exotic representation for intermediate values, where each value isrepresented as a pair of floating-point numbers: one representing the"mantissa" and another representing the "exponent". Performance evaluationdemonstrates that on out-of-cache inputs on an Intel Skylake-X processor thenew Two-Pass algorithm outperforms the traditional Three-Pass algorithm by upto 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. Theproposed Two-Pass algorithm also outperforms the traditional Three-Passalgorithm on Intel Broadwell and AMD Zen 2 processors. To fosterreproducibility, we released an open-source implementation of the new Two-PassSoftmax algorithm and other experiments in this paper as a part of XNNPACKlibrary at GitHub.com/google/XNNPACK.