Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Abstract

There remain many open questions pertaining to the scaling behaviour ofTransformer architectures. These scaling decisions and findings can becritical, as training runs often come with an associated computational costwhich have both financial and/or environmental impact. The goal of this paperis to present scaling insights from pretraining and finetuning Transformers.While Kaplan et al. presents a comprehensive study of the scaling behaviour ofTransformer language models, the scope is only on the upstream (pretraining)loss. Therefore, it is still unclear if these set of findings transfer todownstream task within the context of the pretrain-finetune paradigm. The keyfindings of this paper are as follows: (1) we show that aside from only themodel size, model shape matters for downstream fine-tuning, (2) scalingprotocols operate differently at different compute regions, (3) widely adoptedT5-base and T5-large sizes are Pareto-inefficient. To this end, we presentimproved scaling protocols whereby our redesigned models achieve similardownstream fine-tuning quality while having 50\% fewer parameters and training40\% faster compared to the widely adopted T5-base model. We publicly releaseover 100 pretrained checkpoints of different T5 configurations to facilitatefuture research and analysis.

Quick Read (beta)

loading the full paper ...