GPU Performance Portability needs Autotuning

Abstract

As LLMs grow in complexity, achieving state-of-the-art performance requirestight co-design across algorithms, software, and hardware. Today's reliance ona single dominant platform limits portability, creates vendor lock-in, andraises barriers for new AI hardware. In this work, we make the case forcombining just-in-time (JIT) compilation with comprehensive kernel parameterautotuning to enable portable LLM inference with state-of-the-art performancewithout code changes. Focusing on performance-critical LLM kernels, wedemonstrate that this approach explores up to 15x more kernel parameterconfigurations, produces significantly more diverse code across multipledimensions, and even outperforms vendor-optimized implementations by up to230%, all while reducing kernel code size by 70x and eliminating manual codeoptimizations. Our results highlight autotuning as a promising path tounlocking model portability across GPU vendors.

Quick Read (beta)

loading the full paper ...