Abstract
Real-time object detection is a fundamental but challenging task in computervision, particularly when computational resources are limited. AlthoughYOLO-series models have set strong benchmarks by balancing speed and accuracy,the increasing need for richer global context modeling has led to the use ofTransformer-based architectures. Nevertheless, Transformers have highcomputational complexity because of their self-attention mechanism, whichlimits their practicality for real-time and edge deployments. To overcome thesechallenges, recent developments in linear state space models, such as Mamba,provide a promising alternative by enabling efficient sequence modeling withlinear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novelobject detection framework that balances accuracy and efficiency through threekey contributions: (1) MambaNeXt Block: a hybrid design that integrates CNNswith Mamba to effectively capture both local features and long-rangedependencies; (2) Multi-branch Asymmetric Fusion Pyramid Network (MAFPN): anenhanced feature pyramid architecture that improves multi-scale objectdetection across various object sizes; and (3) Edge-focused Efficiency: ourmethod achieved 66.6% mAP at 31.9 FPS on the PASCAL VOC dataset without anypre-training and supports deployment on edge devices such as the NVIDIA JetsonXavier NX and Orin NX.