Juliette Marrie1,2 Romain Menegaux1 Michael Arbel1 Diane Larlus2 Julien Mairal1
1Inria 2NAVER LABS Europe
We address the task of uplifting visual features or semantic masks from 2D vision models to 3D scenes represented by Gaussian splatting. Whereas common approaches rely on iterative optimization-based procedures, we show that a simple yet effective aggregation technique yields excellent results. Applied to semantic masks from Segment Anything (SAM), our uplifting approach leads to segmentation quality comparable to the state of the art. We then extend this method to generic DINOv2 features, integrating 3D scene geometry through graph diffusion, and achieve competitive segmentation results despite DINOv2 not being trained on millions of annotated masks like SAM.
We propose a simple, parameter-free aggregation mechanism to uplift visual features from models such as DINOv2, CLIP and SAM into 3D Gaussian Splatting scenes. Each Gaussian \( i \) in the scene is assigned a feature \( f_i \) defined as a weighted sum of the 2D features \( F \) over the set of pixels \( \mathcal{S}_i \) affected by Gaussian \( i \) in the forward rendering process:
The website template was borrowed from Ref-NeRF.