Abstract
Efficiently tiling and mapping high-dimensional convolutions onto limitedexecution and buffering resources is a challenge faced by all deep learningaccelerators today. We term each unique approach as dataflow. The dataflowdetermines overall throughput (utilization of the compute units) andenergy-efficiency (reads, writes, and reuse of model parameters and partialsums across the accelerator's memory hierarchy). In this work, we provide afirst-of-its kind framework called MAESTRO to formally describe and analyze CNNdataflows. MAESTRO uses a set of concise pragmas to describe three kinds ofdata reuse - spatial, temporal, and spatio-temporal. It predicts rooflineperformance and energy-efficiency of each dataflow when running neural networklayers, and reports the hardware resources (size of buffers across the memoryhierarchy, and network-on-chip (NoC) bandwidth) required to support thisdataflow. Using MAESTRO, we demonstrate trade-offs between various dataflows,and demonstrate the potential benefits of a hardware substrate with aspecialized NoC that can support adaptive dataflows.