Abstract
We present VLMEvalKit: an open-source toolkit for evaluating largemulti-modality models based on PyTorch. The toolkit aims to provide auser-friendly and comprehensive framework for researchers and developers toevaluate existing multi-modality models and publish reproducible evaluationresults. In VLMEvalKit, we implement over 70 different large multi-modalitymodels, including both proprietary APIs and open-source models, as well as morethan 20 different multi-modal benchmarks. By implementing a single interface,new models can be easily added to the toolkit, while the toolkit automaticallyhandles the remaining workloads, including data preparation, distributedinference, prediction post-processing, and metric calculation. Although thetoolkit is currently mainly used for evaluating large vision-language models,its design is compatible with future updates that incorporate additionalmodalities, such as audio and video. Based on the evaluation results obtainedwith the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard totrack the progress of multi-modality learning research. The toolkit is releasedat https://github.com/open-compass/VLMEvalKit and is actively maintained.