Self-attention, as the key block of transformers, is a powerful mechanism forextracting features from the inputs. In essence, what self-attention does toinfer the pairwise relations between the elements of the inputs, and modify theinputs by propagating information between input pairs. As a result, it mapsinputs to N outputs and casts a quadratic $O(N^2)$ memory and time complexity.We propose centroid attention, a generalization of self-attention that maps Ninputs to M outputs $(M\leq N)$, such that the key information in the inputsare summarized in the smaller number of outputs (called centroids). We designcentroid attention by amortizing the gradient descent update rule of aclustering objective function on the inputs, which reveals an underlyingconnection between attention and clustering. By compressing the inputs to thecentroids, we extract the key information useful for prediction and also reducethe computation of the attention module and the subsequent layers. We apply ourmethod to various applications, including abstractive text summarization, 3Dvision, and image processing. Empirical results demonstrate the effectivenessof our method over the standard transformers.