Abstract
Audio-Visual Speech Recognition (AVSR) leverages audio and visual modalitiesto improve robustness in noisy environments. Recent advances in Large LanguageModels (LLMs) show strong performance in speech recognition, including AVSR.However, the long speech representations lead to high computational costs forLLMs. Prior methods compress inputs before feeding them to LLMs, but highcompression often harms accuracy. To address this, we propose Llama-MTSK, thefirst Matryoshka-based Multimodal LLM for AVSR, which flexibly adaptsaudio-visual token allocation under varying compute constraints. Inspired byMatryoshka Representation Learning, our model encodes representations atmultiple granularities with a single architecture, avoiding the need forseparate models. For efficient fine-tuning, we introduce three LoRA-basedstrategies using global and scale-specific modules. Evaluations on major AVSRdatasets show Llama-MTSK matches or outperforms models trained at fixedcompression levels.