Abstract
Machine learning models based on the aggregated outputs of submodels, eitherat the activation or prediction levels, lead to strong performance. We studythe interplay of two popular classes of such models: ensembles of neuralnetworks and sparse mixture of experts (sparse MoEs). First, we show that thesetwo approaches have complementary features whose combination is beneficial.Then, we present partitioned batch ensembles, an efficient ensemble of sparseMoEs that takes the best of both classes of models. Extensive experiments onfine-tuned vision transformers demonstrate the accuracy, log-likelihood,few-shot learning, robustness, and uncertainty calibration improvements of ourapproach over several challenging baselines. Partitioned batch ensembles notonly scale to models with up to 2.7B parameters, but also provide largerperformance gains for larger models.