Fine-tuning large pre-trained models is an effective transfer mechanism inNLP. However, in the presence of many downstream tasks, fine-tuning isparameter inefficient: an entire new model is required for every task. As analternative, we propose transfer with adapter modules. Adapter modules yield acompact and extensible model; they add only a few trainable parameters pertask, and new tasks can be added without revisiting previous ones. Theparameters of the original network remain fixed, yielding a high degree ofparameter sharing. To demonstrate adapter's effectiveness, we transfer therecently proposed BERT Transformer model to 26 diverse text classificationtasks, including the GLUE benchmark. Adapters attain near state-of-the-artperformance, whilst adding only a few parameters per task. On GLUE, we attainwithin 0.4% of the performance of full fine-tuning, adding only 3.6% parametersper task. By contrast, fine-tuning trains 100% of the parameters per task.