CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Abstract

The exponential growth in demand for GPU computing resources, driven by therapid advancement of Large Language Models, has created an urgent need forautomated CUDA optimization strategies. While recent advances in LLMs showpromise for code generation, current SOTA models (e.g. R1, o1) achieve lowsuccess rates in improving CUDA speed. In this paper, we introduce CUDA-L1, anautomated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task:trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, themodel also demonstrates excellent portability across GPU architectures,achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40,x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100.Beyond these benchmark results, CUDA-L1 demonstrates several remarkableproperties: 1) Discovers a variety of CUDA optimization techniques and learnsto combine them strategically to achieve optimal performance; 2) Uncoversfundamental principles of CUDA optimization; 3) Identifies non-obviousperformance bottlenecks and rejects seemingly beneficial optimizations thatharm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning cantransform an initially poor-performing LLM into an effective CUDA optimizerthrough speedup-based reward signals alone, without human expertise or domainknowledge. More importantly, the trained RL model extend the acquired reasoningabilities to new kernels. This paradigm opens possibilities for automatedoptimization of CUDA operations, and holds promise to substantially promote GPUefficiency and alleviate the rising pressure on GPU computing resources.

Quick Read (beta)

loading the full paper ...