Transformer tricks: Removing weights for skipless transformers

Abstract

He and Hofmann (arXiv:2311.01906) detailed a skipless transformer without theV and P (post-attention projection) linear layers, which reduces the totalnumber of weights. However, this scheme is only applicable to MHA (multi-headattention), but not for MQA (multi-query attention) and GQA (grouped-queryattention). The latter schemes are used by many popular LLMs such as Llama 2,Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposesmathematically equivalent versions that are suitable for MQA and GQA. Forexample, removing Q and P from a skipless version of Mistral-7B would remove15% of its weights (and thus reduce its compute and memory complexity). SeearXiv:2402.13388 and https://github.com/OpenMachine-ai/transformer-tricks forcode and more transformer tricks.

Quick Read (beta)

loading the full paper ...