Scaling Laws for Generative Mixed-Modal Language Models

Abstract

Generative language models define distributions over sequences of tokens thatcan represent essentially any combination of data modalities (e.g., anypermutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokensfor language or code, and so on). To better understand the scaling propertiesof such mixed-modal models, we conducted over 250 experiments using sevendifferent modalities and model sizes ranging from 8 million to 30 billion,trained on 5-100 billion tokens. We report new mixed-modal scaling laws thatunify the contributions of individual modalities and the interactions betweenthem. Specifically, we explicitly model the optimal synergy and competition dueto data and model size as an additive term to previous uni-modal scaling laws.We also find four empirical phenomena observed during the training, such asemergent coordinate-ascent style training that naturally alternates betweenmodalities, guidelines for selecting critical hyper-parameters, and connectionsbetween mixed-modal competition and training stability. Finally, we test ourscaling law by training a 30B speech-text model, which significantlyoutperforms the corresponding unimodal models. Overall, our research providesvaluable insights into the design and training of mixed-modal generativemodels, an important new class of unified models that have uniquedistributional properties.

Quick Read (beta)

loading the full paper ...