Abstract
Learning multimodal representations involves integrating information frommultiple heterogeneous sources of data. In order to accelerate progress towardsunderstudied modalities and tasks while ensuring real-world robustness, werelease MultiZoo, a public toolkit consisting of standardized implementationsof > 20 core multimodal algorithms and MultiBench, a large-scale benchmarkspanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.Together, these provide an automated end-to-end machine learning pipeline thatsimplifies and standardizes data loading, experimental setup, and modelevaluation. To enable holistic evaluation, we offer a comprehensive methodologyto assess (1) generalization, (2) time and space complexity, and (3) modalityrobustness. MultiBench paves the way towards a better understanding of thecapabilities and limitations of multimodal models, while ensuring ease of use,accessibility, and reproducibility. Our toolkits are publicly available, willbe regularly updated, and welcome inputs from the community.