Abstract
Reasoning Vision-Language Models (VLMs) have shown promising performance oncomplex multimodal tasks. However, they still face significant challenges: theyare highly sensitive to reasoning errors, require large volumes of annotateddata or accurate verifiers, and struggle to generalize beyond specific domains.To address these limitations, we explore self-correction as a strategy toenhance reasoning VLMs. We first conduct an in-depth analysis of reasoningVLMs' self-correction abilities and identify key gaps. Based on our findings,we introduce Sherlock, a self-correction and self-improvement trainingframework. Sherlock introduces a trajectory-level self-correction objective, apreference data construction method based on visual perturbation, and a dynamic$\beta$ for preference tuning. Once the model acquires self-correctioncapabilities using only 20k randomly sampled annotated data, it continues toself-improve without external supervision. Built on the Llama3.2-Vision-11Bmodel, Sherlock achieves remarkable results across eight benchmarks, reachingan average accuracy of 64.1 with direct generation and 65.4 afterself-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), andLlamaV-o1 (63.4) while using less than 20% of the annotated data.