Abstract
Hallucination is a big shadow hanging over the rapidly evolving MultimodalLarge Language Models (MLLMs), referring to the phenomenon that the generatedtext is inconsistent with the image content. In order to mitigatehallucinations, existing studies mainly resort to an instruction-tuning mannerthat requires retraining the models with specific data. In this paper, we pavea different way, introducing a training-free method named Woodpecker. Like awoodpecker heals trees, it picks out and corrects hallucinations from thegenerated text. Concretely, Woodpecker consists of five stages: key conceptextraction, question formulation, visual knowledge validation, visual claimgeneration, and hallucination correction. Implemented in a post-remedy manner,Woodpecker can easily serve different MLLMs, while being interpretable byaccessing intermediate outputs of the five stages. We evaluate Woodpecker bothquantitatively and qualitatively and show the huge potential of this newparadigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvementin accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is releasedat https://github.com/BradyFU/Woodpecker.