Learning to Optimize Tensor Programs

Abstract

We introduce a learning-based framework to optimize tensor programs for deeplearning workloads. Efficient implementations of tensor operators, such asmatrix multiplication and high dimensional convolution, are key enablers ofeffective deep learning systems. However, existing systems rely on manuallyoptimized libraries such as cuDNN where only a narrow range of server classGPUs are well-supported. The reliance on hardware-specific operator librarieslimits the applicability of high-level graph optimizations and incurssignificant engineering costs when deploying to new hardware targets. We uselearning to remove this engineering burden. We learn domain-specificstatistical cost models to guide the search of tensor operator implementationsover billions of possible program variants. We further accelerate the search byeffective model transfer across workloads. Experimental results show that ourframework delivers performance competitive with state-of-the-art hand-tunedlibraries for low-power CPU, mobile GPU, and server-class GPU.

Quick Read (beta)

loading the full paper ...