CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Abstract

The exponential growth in demand for GPU computing resources has created anurgent need for automated CUDA optimization strategies. While recent advancesin LLMs show promise for code generation, current SOTA models achieve lowsuccess rates in improving CUDA speed. In this paper, we introduce CUDA-L1, anautomated reinforcement learning framework for CUDA optimization that employs anovel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDAoptimization task: trained on NVIDIA A100, it delivers an average speedup ofx3.12 with a median speedup of x1.42 across all 250 CUDA kernels ofKernelBench, with peak speedups reaching x120. Furthermore, the model alsodemonstrates portability across GPU architectures, achieving average speedupsof x3.12 on L40, x2.50 on RTX 3090, x2.39 on H100, and x2.37 on H20 despitebeing optimized specifically for A100. The capabilities of CUDA-L1 demonstrate that, RL can transform an initiallypoor-performing LLM into an effective CUDA optimizer through speedup-basedreward signals alone, without human expertise or domain knowledge. Thisparadigm opens possibilities for automated optimization of CUDA operations, andholds promise to substantially promote GPU efficiency and alleviate the risingpressure on GPU computing resources. We also identify important challengesposed by training RL models for tasks like CUDA development, where RL oftenlearns to exploit loopholes in reward functions rather than solve the intendedoptimization problems. By identifying these failure modes and analyzing theirroot causes, we develop practical methods for creating more robust trainingprocedures that prevent reward hacking.

Quick Read (beta)

loading the full paper ...