This paper explores a simple method for improving the zero-shot learningabilities of language models. We show that instruction tuning -- finetuninglanguage models on a collection of tasks described via instructions --substantially boosts zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it onover 60 NLP tasks verbalized via natural language instruction templates. Weevaluate this instruction-tuned model, which we call FLAN, on unseen tasktypes. FLAN substantially improves the performance of its unmodifiedcounterpart and surpasses zero-shot 175B GPT-3 on 19 of 25 tasks that weevaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE,BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that numberof tasks and model scale are key components to the success of instructiontuning.