Abstract
We introduce MoMa, a novel modality-aware mixture-of-experts (MoE)architecture designed for pre-training mixed-modal, early-fusion languagemodels. MoMa processes images and text in arbitrary sequences by dividingexpert modules into modality-specific groups. These groups exclusively processdesignated tokens while employing learned routing within each group to maintainsemantically informed adaptivity. Our empirical results reveal substantialpre-training efficiency gains through this modality-specific parameterallocation. Under a 1-trillion-token training budget, the MoMa 1.4B model,featuring 4 text experts and 4 image experts, achieves impressive FLOPssavings: 3.7x overall, with 2.6x for text and 5.2x for image processingcompared to a compute-equivalent dense baseline, measured by pre-training loss.This outperforms the standard expert-choice MoE with 8 mixed-modal experts,which achieves 3x overall FLOPs savings (3x for text, 2.8x for image).Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPssavings to 4.2x overall (text: 3.4x, image: 5.3x), although this combinationhurts performance in causal inference due to increased sensitivity to routeraccuracy. These results demonstrate MoMa's potential to significantly advancethe efficiency of mixed-modal, early-fusion language model pre-training, pavingthe way for more resource-efficient and capable multimodal AI systems.