Enhancing Training Efficiency Using Packing with Flash Attention

Abstract

Padding is often used in tuning LLM models by adding special tokens toshorter training examples to match the length of the longest sequence in eachbatch. While this ensures uniformity for batch processing, it introducesinefficiencies by including irrelevant padding tokens in the computation andwastes GPU resources. On the other hand, the Hugging Face SFT trainer offersthe option to use packing to combine multiple training examples up to themaximum sequence length. This allows for maximal utilization of GPU resources.However, without proper masking of each packed training example, attention willnot be computed correctly when using SFT trainer. We enable and then analysepacking and Flash Attention with proper attention masking of each example andshow the benefits of this training paradigm.

Quick Read (beta)

loading the full paper ...