Pre-trained Transformers are now ubiquitous in natural language processing,but despite their high end-task performance, little is known empirically aboutwhether they are calibrated. Specifically, do these models' posteriorprobabilities provide an accurate empirical measure of how likely the model isto be correct on a given example? We focus on BERT and RoBERTa in this work,and analyze their calibration across three tasks: natural language inference,paraphrase detection, and commonsense reasoning. For each task, we considerin-domain as well as challenging out-of-domain settings, where models face moreexamples they should be uncertain about. We show that: (1) when usedout-of-the-box, pre-trained models are calibrated in-domain, and compared tobaselines, their calibration error out-of-domain can be as much as 3.5x lower;(2) temperature scaling is effective at further reducing calibration errorin-domain, and using label smoothing to deliberately increase empiricaluncertainty helps calibrate posteriors out-of-domain.