LUDVIG: Learning-free Uplifting of 2D Visual features to Gaussian Splatting scenes

Published in arXiv preprint, 2024

Juliette Marrie, Romain Menegaux, Michael Arbel, Diane Larlus, Julien Mairal

We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into 3D Gaussian Splatting scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion enriches features from a given model, such as CLIP, by leveraging 3D geometry and pairwise similarities induced by another strong model such as DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups. Notably, we obtain competitive segmentation results using generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open-vocabulary object detection tasks, highlighting the versatility of our approach.

Media content
Illustration of the inverse and forward rendering between 2D visual features (produced by DINOv2) and a 3D Gaussian Splatting scene. In the inverse rendering (or uplifting) phase, features are created for each 3D Gaussian by aggregating coarse 2D features over all viewing directions. For forward rendering, the 3D features are projected on any given viewing direction as in regular Gaussian Splatting.
3D graph diffusion for foreground/background segmentation. The 3D mask spreads to neighboring Gaussians with similar DINOv2 features.
3D DINOv2 features (left), CLIP features (middle) and CLIP relevancy with text prompts (right).

Read the full paper