Abstract
Self-attention mechanisms model long-range context by using pairwiseattention between all input tokens. In doing so, they assume a fixed attentiongranularity defined by the individual tokens (e.g., text characters or imagepixels), which may not be optimal for modeling complex dependencies at higherlevels. In this paper, we propose ContextPool to address this problem byadapting the attention granularity for each token. Inspired by the success ofConvNets that are combined with pooling to capture long-range dependencies, welearn to pool neighboring features for each token before computing attention ina given attention layer. The pooling weights and support size are adaptivelydetermined, allowing the pooled features to encode meaningful context withvarying scale. We show that ContextPool makes attention models more expressive,achieving strong performance often with fewer layers and thus significantlyreduced cost. Experiments validate that our ContextPool module, when pluggedinto transformer models, matches or surpasses state-of-the-art performanceusing less compute on several language and image benchmarks, outperforms recentworks with learned context sizes or sparse attention patterns, and is alsoapplicable to ConvNets for efficient feature learning.