Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images

  • 2025-04-14 18:30:56
  • Boyang Deng, Songyou Peng, Kyle Genova, Gordon Wetzstein, Noah Snavely, Leonidas Guibas, Thomas Funkhouser
  • 0

Abstract

We present a system using Multimodal LLMs (MLLMs) to analyze a large databasewith tens of millions of images captured at different times, with the aim ofdiscovering patterns in temporal changes. Specifically, we aim to capturefrequent co-occurring changes ("trends") across a city over a certain period.Unlike previous visual analyses, our analysis answers open-ended queries (e.g.,"what are the frequent types of changes in the city?") without anypredetermined target subjects or training labels. These properties cast priorlearning-based or unsupervised visual analysis tools unsuitable. We identifyMLLMs as a novel tool for their open-ended semantic understanding capabilities.Yet, our datasets are four orders of magnitude too large for an MLLM to ingestas context. So we introduce a bottom-up procedure that decomposes the massivevisual analysis problem into more tractable sub-problems. We carefully designMLLM-based solutions to each sub-problem. During experiments and ablationstudies with our system, we find it significantly outperforms baselines and isable to discover interesting trends from images captured in large cities (e.g.,"addition of outdoor dining,", "overpass was painted blue," etc.). See moreresults and interactive demos at https://boyangdeng.com/visual-chronicles.

 

Quick Read (beta)

loading the full paper ...