Controlling Multimodal LLMs via Reward-guided Decoding

Abstract

As Multimodal Large Language Models (MLLMs) gain widespread applicability, itis becoming increasingly desirable to adapt them for diverse user needs. Inthis paper, we study the adaptation of MLLMs through controlled decoding. Toachieve this, we introduce the first method for reward-guided decoding of MLLMsand demonstrate its application in improving their visual grounding. Our methodinvolves building reward models for visual grounding and using them to guidethe MLLM's decoding process. Concretely, we build two separate reward models toindependently control the degree of object precision and recall in the model'soutput. Our approach enables on-the-fly controllability of an MLLM's inferenceprocess in two ways: first, by giving control over the relative importance ofeach reward function during decoding, allowing a user to dynamically trade offobject precision for recall in image captioning tasks; second, by givingcontrol over the breadth of the search during decoding, allowing the user tocontrol the trade-off between the amount of test-time compute and the degree ofvisual grounding. We evaluate our method on standard object hallucinationbenchmarks, showing that it provides significant controllability over MLLMinference, while consistently outperforming existing hallucination mitigationmethods.

Quick Read (beta)

loading the full paper ...