Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning

Abstract

Food classification is the foundation for developing food vision tasks andplays a key role in the burgeoning field of computational nutrition. Due to thecomplexity of food requiring fine-grained classification, recent academicresearch mainly modifies Convolutional Neural Networks (CNNs) and/or VisionTransformers (ViTs) to perform food category classification. However, to learnfine-grained features, the CNN backbone needs additional structural design,whereas ViT, containing the self-attention module, has increased computationalcomplexity. In recent months, a new Sequence State Space (S4) model, through aSelection mechanism and computation with a Scan (S6), colloquially termedMamba, has demonstrated superior performance and computation efficiencycompared to the Transformer architecture. The VMamba model, which incorporatesthe Mamba mechanism into image tasks (such as classification), currentlyestablishes the state-of-the-art (SOTA) on the ImageNet dataset. In thisresearch, we introduce an academically underestimated food dataset CNFOOD-241,and pioneer the integration of a residual learning framework within the VMambamodel to concurrently harness both global and local state features inherent inthe original VMamba architectural design. The research results show that VMambasurpasses current SOTA models in fine-grained and food classification. Theproposed Res-VMamba further improves the classification accuracy to 79.54\%without pretrained weight. Our findings elucidate that our proposed methodologyestablishes a new benchmark for SOTA performance in food recognition on theCNFOOD-241 dataset. The code can be obtained on GitHub:https://github.com/ChiShengChen/ResVMamba.

Quick Read (beta)

loading the full paper ...