MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

  • 2025-09-02 11:28:27
  • Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, Yuzhuo Fu
  • 0

Abstract

With the rapid growth of academic publications, peer review has become anessential yet time-consuming responsibility within the research community.Large Language Models (LLMs) have increasingly been adopted to assist in thegeneration of review comments; however, current LLM-based review tasks lack aunified evaluation benchmark to rigorously assess the models' ability toproduce comprehensive, accurate, and human-aligned assessments, particularly inscenarios involving multimodal content such as figures and tables. To addressthis gap, we propose \textbf{MMReview}, a comprehensive benchmark that spansmultiple disciplines and modalities. MMReview includes multimodal content andexpert-written review comments for 240 papers across 17 research domains withinfour major academic disciplines: Artificial Intelligence, Natural Sciences,Engineering Sciences, and Social Sciences. We design a total of 13 tasksgrouped into four core categories, aimed at evaluating the performance of LLMsand Multimodal LLMs (MLLMs) in step-wise review generation, outcomeformulation, alignment with human preferences, and robustness to adversarialinput manipulation. Extensive experiments conducted on 16 open-source modelsand 5 advanced closed-source models demonstrate the thoroughness of thebenchmark. We envision MMReview as a critical step toward establishing astandardized foundation for the development of automated peer review systems.

 

Quick Read (beta)

loading the full paper ...