Contextual word representations derived from large-scale neural languagemodels are successful across a diverse set of NLP tasks, suggesting that theyencode useful and transferable features of language. To shed light on thelinguistic knowledge they capture, we study the representations produced byseveral recent pretrained contextualizers (variants of ELMo, the OpenAItransformer language model, and BERT) with a suite of sixteen diverse probingtasks. We find that linear models trained on top of frozen contextualrepresentations are competitive with state-of-the-art task-specific models inmany cases, but fail on tasks requiring fine-grained linguistic knowledge(e.g., conjunct identification). To investigate the transferability ofcontextual word representations, we quantify differences in the transferabilityof individual layers within contextualizers, especially between recurrentneural networks (RNNs) and transformers. For instance, higher layers of RNNsare more task-specific, while transformer layers do not exhibit the samemonotonic trend. In addition, to better understand what makes contextual wordrepresentations transferable, we compare language model pretraining with elevensupervised pretraining tasks. For any given task, pretraining on a closelyrelated task yields better performance than language model pretraining (whichis better on average) when the pretraining dataset is fixed. However, languagemodel pretraining on more data gives the best results.