Transformers-based models, such as BERT, have been one of the most successfuldeep learning models for NLP. Unfortunately, one of their core limitations isthe quadratic dependency (mainly in terms of memory) on the sequence length dueto their full attention mechanism. To remedy this, we propose, BigBird, asparse attention mechanism that reduces this quadratic dependency to linear. Weshow that BigBird is a universal approximator of sequence functions and isTuring complete, thereby preserving these properties of the quadratic, fullattention model. Along the way, our theoretical analysis reveals some of thebenefits of having $O(1)$ global tokens (such as CLS), that attend to theentire sequence as part of the sparse attention mechanism. The proposed sparseattention can handle sequences of length up to 8x of what was previouslypossible using similar hardware. As a consequence of the capability to handlelonger context, BigBird drastically improves performance on various NLP taskssuch as question answering and summarization. We also propose novelapplications to genomics data.