Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

  • 2025-04-01 09:34:57
  • Bizhe Bai, Jianjian Cao, Yadan Luo, Tao Chen
  • 0

Abstract

Grounded Conversation Generation (GCG) is an emerging vision-language taskthat requires models to generate natural language responses seamlesslyintertwined with corresponding object segmentation masks. Recent models, suchas GLaMM and OMG-LLaVA, achieve pixel-level grounding but incur significantcomputational costs due to processing a large number of visual tokens. Existingtoken pruning methods, like FastV and PyramidDrop, fail to preserve the localvisual features critical for accurate grounding, leading to substantialperformance drops in GCG tasks. To address this, we propose AdaptiveLocal-Aware Token Pruning (ALTP), a simple yet effective framework thataccelerates GCG models by prioritizing local object information. ALTPintroduces two key components: (1) Detail Density Capture (DDC), which usessuperpixel segmentation to retain tokens in object-centric regions, preservingfine-grained details, and (2) Dynamic Density Formation (DDF), whichdynamically allocates tokens based on information density, ensuring higherretention in semantically rich areas. Extensive experiments on the GranDfdataset demonstrate that ALTP significantly outperforms existing token pruningmethods, such as FastV and PyramidDrop, on both GLaMM and OMG-LLaVA models.Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokenswith a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared toPyramidDrop. Similarly, on OMG-LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0%at a 90% token reduction compared with PDrop.

 

Quick Read (beta)

loading the full paper ...