Vision Language Models: A Survey of 26K Papers

Abstract

We present a transparent, reproducible measurement of research trends across26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titlesand abstracts are normalized, phrase-protected, and matched against ahand-crafted lexicon to assign up to 35 topical labels and mine fine-grainedcues about tasks, architectures, training regimes, objectives, datasets, andco-mentioned modalities. The analysis quantifies three macro shifts: (1) asharp rise of multimodal vision-language-LLM work, which increasingly reframesclassic perception as instruction following and multi-step reasoning; (2)steady expansion of generative methods, with diffusion research consolidatingaround controllability, distillation, and speed; and (3) resilient 3D and videoactivity, with composition moving from NeRFs to Gaussian splatting and agrowing emphasis on human- and agent-centric understanding. Within VLMs,parameter-efficient adaptation like prompting/adapters/LoRA and lightweightvision-language bridges dominate; training practice shifts from buildingencoders from scratch to instruction tuning and finetuning strong backbones;contrastive objectives recede relative to cross-entropy/ranking anddistillation. Cross-venue comparisons show CVPR has a stronger 3D footprint andICLR the highest VLM share, while reliability themes such as efficiency orrobustness diffuse across areas. We release the lexicon and methodology toenable auditing and extension. Limitations include lexicon recall andabstract-only scope, but the longitudinal signals are consistent across venuesand years.

Quick Read (beta)

loading the full paper ...