Abstract
3D geometric information is essential for manipulation tasks, as robots needto perceive the 3D environment, reason about spatial relationships, andinteract with intricate spatial configurations. Recent research hasincreasingly focused on the explicit extraction of 3D features, while stillfacing challenges such as the lack of large-scale robotic 3D data and thepotential loss of spatial geometry. To address these limitations, we proposethe Lift3D framework, which progressively enhances 2D foundation models withimplicit and explicit 3D robotic representations to construct a robust 3Dmanipulation policy. Specifically, we first design a task-aware maskedautoencoder that masks task-relevant affordance patches and reconstructs depthinformation, enhancing the 2D foundation model's implicit 3D roboticrepresentation. After self-supervised fine-tuning, we introduce a 2Dmodel-lifting strategy that establishes a positional mapping between the input3D points and the positional embeddings of the 2D model. Based on the mapping,Lift3D utilizes the 2D foundation model to directly encode point cloud data,leveraging large-scale pretrained knowledge to construct explicit 3D roboticrepresentations while minimizing spatial information loss. In experiments,Lift3D consistently outperforms previous state-of-the-art methods acrossseveral simulation benchmarks and real-world scenarios.