Abstract
Powerful foundation models, including large language models (LLMs), withTransformer architectures have ushered in a new era of Generative AI acrossvarious industries. Industry and research community have witnessed a largenumber of new applications, based on those foundation models. Such applicationsinclude question and answer, customer services, image and video generation, andcode completions, among others. However, as the number of model parametersreaches to hundreds of billions, their deployment incurs prohibitive inferencecosts and high latency in real-world scenarios. As a result, the demand forcost-effective and fast inference using AI accelerators is ever more higher. Tothis end, our tutorial offers a comprehensive discussion on complementaryinference optimization techniques using AI accelerators. Beginning with anoverview of basic Transformer architectures and deep learning systemframeworks, we deep dive into system optimization techniques for fast andmemory-efficient attention computations and discuss how they can be implementedefficiently on AI accelerators. Next, we describe architectural elements thatare key for fast transformer inference. Finally, we examine various modelcompression and fast decoding strategies in the same context.