Abstract
Food recognition plays an important role in food choice and intake, which isessential to the health and well-being of humans. It is thus of importance tothe computer vision community, and can further support many food-orientedvision and multimodal tasks. Unfortunately, we have witnessed remarkableadvancements in generic visual recognition for released large-scale datasets,yet largely lags in the food domain. In this paper, we introduce Food2K, whichis the largest food recognition dataset with 2,000 categories and over 1million images.Compared with existing food recognition datasets, Food2Kbypasses them in both categories and images by one order of magnitude, and thusestablishes a new challenging benchmark to develop advanced models for foodvisual representation learning. Furthermore, we propose a deep progressiveregion enhancement network for food recognition, which mainly consists of twocomponents, namely progressive local feature learning and region featureenhancement. The former adopts improved progressive training to learn diverseand complementary local features, while the latter utilizes self-attention toincorporate richer context with multiple scales into local features for furtherlocal feature enhancement. Extensive experiments on Food2K demonstrate theeffectiveness of our proposed method. More importantly, we have verified bettergeneralization ability of Food2K in various tasks, including food recognition,food image retrieval, cross-modal recipe retrieval, food detection andsegmentation. Food2K can be further explored to benefit more food-relevanttasks including emerging and more complex ones (e.g., nutritional understandingof food), and the trained models on Food2K can be expected as backbones toimprove the performance of more food-relevant tasks. We also hope Food2K canserve as a large scale fine-grained visual recognition benchmark.