R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Abstract

In large multimodal models (LMMs), the perception of non-language modalities(e.g., visual representations) is usually not on par with the large languagemodels (LLMs)' powerful reasoning capabilities, deterring LMMs' performance onchallenging downstream tasks. This weakness has been recently mitigated byreplacing the vision encoder with a mixture-of-experts (MoE), which providesrich, multi-granularity, and diverse representations required by diversedownstream tasks. The performance of multimodal MoE largely depends on itsrouter, which reweights and mixes the representations of different experts foreach input. However, we find that the end-to-end trained router does not alwaysproduce the optimal routing weights for every test sample. To bridge the gap,we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) thatlocally optimizes the vector of routing weights in test-time by moving ittoward those vectors of the correctly predicted samples in a neighborhood ofthe test sample. We propose three R2-T2 strategies with different optimizationobjectives and neighbor-search spaces. R2-T2 consistently and greatly improvesstate-of-the-art LMMs' performance on challenging benchmarks of diverse tasks,without training any base-model parameters.

Quick Read (beta)

loading the full paper ...