Abstract
We present Neural Feature Fusion Fields (N3F), a method that improves dense2D image feature extractors when the latter are applied to the analysis ofmultiple images reconstructible as a 3D scene. Given an image featureextractor, for example pre-trained using self-supervision, N3F uses it as ateacher to learn a student network defined in 3D space. The 3D student networkis similar to a neural radiance field that distills said features and can betrained with the usual differentiable rendering machinery. As a consequence,N3F is readily applicable to most neural rendering formulations, includingvanilla NeRF and its extensions to complex dynamic scenes. We show that ourmethod not only enables semantic understanding in the context of scene-specificneural fields without the use of manual labels, but also consistently improvesover the self-supervised 2D baselines. This is demonstrated by consideringvarious tasks, such as 2D object retrieval, 3D segmentation, and scene editing,in diverse sequences, including long egocentric videos in the EPIC-KITCHENSbenchmark.