Abstract
Given the potential applications of generating recipes from food images, thisarea has garnered significant attention from researchers in recent years.Existing works for recipe generation primarily utilize a two-stage trainingmethod, first generating ingredients and then obtaining instructions from boththe image and ingredients. Large Multi-modal Models (LMMs), which have achievednotable success across a variety of vision and language tasks, shed light togenerating both ingredients and instructions directly from images.Nevertheless, LMMs still face the common issue of hallucinations during recipegeneration, leading to suboptimal performance. To tackle this, we propose aretrieval augmented large multimodal model for recipe generation. We firstintroduce Stochastic Diversified Retrieval Augmentation (SDRA) to retrieverecipes semantically related to the image from an existing datastore as asupplement, integrating them into the prompt to add diverse and rich context tothe input image. Additionally, Self-Consistency Ensemble Voting mechanism isproposed to determine the most confident prediction recipes as the finaloutput. It calculates the consistency among generated recipe candidates, whichuse different retrieval recipes as context for generation. Extensiveexperiments validate the effectiveness of our proposed method, whichdemonstrates state-of-the-art (SOTA) performance in recipe generation tasks onthe Recipe1M dataset.