Abstract
AI spans from large language models to tiny models running onmicrocontrollers (MCUs). Extremely memory-efficient model architectures aredecisive to fit within an MCU's tiny memory budget e.g., 128kB of RAM. However,inference latency must remain small to fit real-time constraints. An approachto tackle this is patch-based fusion, which aims to optimize data flows acrossneural network layers. In this paper, we introduce msf-CNN, a novel techniquethat efficiently finds optimal fusion settings for convolutional neuralnetworks (CNNs) by walking through the fusion solution space represented as adirected acyclic graph. Compared to previous work on CNN fusion for MCUs,msf-CNN identifies a wider set of solutions. We published an implementation ofmsf-CNN running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32). Weshow that msf-CNN can achieve inference using 50% less RAM compared to theprior art (MCUNetV2 and StreamNet). We thus demonstrate how msf-CNN offersadditional flexibility for system designers.