Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation

Abstract

Serving Large Language Models (LLMs) is critical for AI-powered applications,yet it demands substantial computational resources, particularly in memorybandwidth and computational throughput. Low-precision computation has emergedas a key technique to improve efficiency while reducing resource consumption.Existing approaches for generating low-precision kernels are limited to weightbit widths that are powers of two and suffer from suboptimal performancebecause of high-level GPU programming abstractions. These abstractions restrictcritical optimizations, such as fine-grained register management and optimizedmemory access patterns, that are essential for efficient low-precisioncomputations. In this paper, we introduce Tilus, a domain-specific languagedesigned for General-Purpose GPU (GPGPU) computing that supports low-precisiondata types with arbitrary bit widths from 1 to 8 while maintaining GPUprogrammability. Tilus features a thread-block-level programming model, ahierarchical memory space, a novel algebraic layout system, and extensivesupport for diverse low-precision data types. Tilus programs are compiled intohighly efficient GPU programs through automatic vectorization and instructionselection. Extensive experiments demonstrate that Tilus efficiently supports afull spectrum of low-precision data types, and outperforms state-of-the-artlow-precision kernels. Compared to existing compilers such as Triton andLadder, as well as hand-optimized kernels such as QuantLLM and Marlin, Tilusachieves performance improvements of: $1.75\times$, $2.61\times$, $1.29\times$and $1.03\times$, respectively. We open-source Tilus athttps://github.com/NVIDIA/tilus.

Quick Read (beta)

loading the full paper ...