LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Abstract

The large-scale training of multi-modal models on data scraped from the webhas shown outstanding utility in infusing these models with the required worldknowledge to perform effectively on multiple downstream tasks. However, onedownside of scraping data from the web can be the potential sacrifice of thebenchmarks on which the abilities of these models are often evaluated. Tosafeguard against test data contamination and to truly test the abilities ofthese foundation models we propose LiveXiv: A scalable evolving live benchmarkbased on scientific ArXiv papers. LiveXiv accesses domain-specific manuscriptsat any given timestamp and proposes to automatically generate visualquestion-answer pairs (VQA). This is done without any human-in-the-loop, usingthe multi-modal content in the manuscripts, like graphs, charts, and tables.Moreover, we introduce an efficient evaluation approach that estimates theperformance of all models on the evolving benchmark using evaluations of only asubset of models. This significantly reduces the overall evaluation cost. Webenchmark multiple open and proprietary Large Multi-modal Models (LMMs) on thefirst version of our benchmark, showing its challenging nature and exposing themodels true abilities, avoiding contamination. Lastly, in our commitment tohigh quality, we have collected and evaluated a manually verified subset. Bycomparing its overall results to our automatic annotations, we have found thatthe performance variance is indeed minimal (<2.5%). Our dataset is availableonline on HuggingFace, and our code will be available here.

Quick Read (beta)

loading the full paper ...