Abstract
3D affordance reasoning, the task of associating human instructions with thefunctional regions of 3D objects, is a critical capability for embodied agents.Current methods based on 3D Gaussian Splatting (3DGS) are fundamentally limitedto single-object, single-step interactions, a paradigm that falls short ofaddressing the long-horizon, multi-object tasks required for complex real-worldapplications. To bridge this gap, we introduce the novel task of Sequential 3DGaussian Affordance Reasoning and establish SeqAffordSplat, a large-scalebenchmark featuring 1800+ scenes to support research on long-horizon affordanceunderstanding in complex 3DGS environments. We then propose SeqSplatNet, anend-to-end framework that directly maps an instruction to a sequence of 3Daffordance masks. SeqSplatNet employs a large language model thatautoregressively generates text interleaved with special segmentation tokens,guiding a conditional decoder to produce the corresponding 3D mask. To handlecomplex scene geometry, we introduce a pre-training strategy, ConditionalGeometric Reconstruction, where the model learns to reconstruct completeaffordance region masks from known geometric observations, thereby building arobust geometric prior. Furthermore, to resolve semantic ambiguities, we designa feature injection mechanism that lifts rich semantic features from 2D VisionFoundation Models (VFM) and fuses them into the 3D decoder at multiple scales.Extensive experiments demonstrate that our method sets a new state-of-the-arton our challenging benchmark, effectively advancing affordance reasoning fromsingle-step interactions to complex, sequential tasks at the scene level.