Abstract
DETR-based methods, which use multi-layer transformer decoders to refineobject queries iteratively, have shown promising performance in 3D indoorobject detection. However, the scene point features in the transformer decoderremain fixed, leading to minimal contributions from later decoder layers,thereby limiting performance improvement. Recently, State Space Models (SSM)have shown efficient context modeling ability with linear complexity throughiterative interactions between system states and inputs. Inspired by SSMs, wepropose a new 3D object DEtection paradigm with an interactive STate spacemodel (DEST). In the interactive SSM, we design a novel state-dependent SSMparameterization method that enables system states to effectively serve asqueries in 3D indoor detection tasks. In addition, we introduce four keydesigns tailored to the characteristics of point cloud and SSM: Theserialization and bidirectional scanning strategies enable bidirectionalfeature interaction among scene points within the SSM. The inter-stateattention mechanism models the relationships between state points, while thegated feed-forward network enhances inter-channel correlations. To the best ofour knowledge, this is the first method to model queries as system states andscene points as system inputs, which can simultaneously update scene pointfeatures and query features with linear complexity. Extensive experiments ontwo challenging datasets demonstrate the effectiveness of our DEST-basedmethod. Our method improves the GroupFree baseline in terms of AP50 on ScanNetV2 (+5.3) and SUN RGB-D (+3.2) datasets. Based on the VDETR baseline, Ourmethod sets a new SOTA on the ScanNetV2 and SUN RGB-D datasets.