Abstract
Achieving human-like reasoning capabilities in Multimodal Large LanguageModels (MLLMs) has long been a goal. Current methods primarily focus onsynthesizing positive rationales, typically relying on manual annotations orcomplex systems. Moreover, they often overlook negative reasoning, which limitsthe model's generalization ability and robustness in multimodal inference. Toaddress this gap, we propose a novel framework: \textbf{S}elf-Aligning\textbf{M}ultimodal Reasoning with \textbf{A}nswer-O\textbf{r}ientedChain-of-\textbf{T}hought (SMART). SMART employs an answer-orientedchain-of-thought (AoT) prompt to automatically construct high-quality data.Drawing inspiration from human proof-based strategies, AoT leverages bothcorrect and incorrect answers to extract key visual information that linksquestions and answers. When provided with correct answers, the model producesstrong positive rationales. Conversely, when correct answers are replaced withincorrect alternatives, the model generates an erroneous yet compellingreasoning path, serving as a form of discriminative negative rationale. Modelstrained with AoT-generated data outperform those trained on manually annotateddatasets, demonstrating superior reasoning capabilities. Consequently, SMARTestablishes an iterative generation-optimization method that continuallyenhances the model's reasoning skills. Experiments indicate that the SMARTframework significantly improves various MLLMs, regardless of modelarchitecture, parameter size, or pre-training dataset. The code is available athttps://github.com/WentaoTan/SMART.