DINOv2 patch-level representations predicted from RGB images (left) are aggregated into highly detailed 3D representations (right).

Abstract

We address the task of uplifting visual features or semantic masks from 2D vision models to 3D scenes represented by Gaussian splatting. Whereas common approaches rely on iterative optimization-based procedures, we show that a simple yet effective aggregation technique yields excellent results. Applied to semantic masks from Segment Anything (SAM), our uplifting approach leads to segmentation quality comparable to the state of the art. We then extend this method to generic DINOv2 features, integrating 3D scene geometry through graph diffusion, and achieve competitive segmentation results despite DINOv2 not being trained on millions of annotated masks like SAM.

Optimization-free uplifting


Illustration of the inverse and forward rendering of 2D visual features produced by DINOv2.


We propose a simple, parameter-free aggregation mechanism to uplift visual features from models such as DINOv2, CLIP and SAM into 3D Gaussian Splatting scenes. Each Gaussian \( i \) in the scene is assigned a feature \( f_i \) defined as a weighted sum of the 2D features \( F \) over the set of pixels \( \mathcal{S}_i \) affected by Gaussian \( i \) in the forward rendering process:

\[ f_i = \underbrace{\sum_{(d,p)\in\mathcal{S}_i}}_{\scriptsize \begin{array}{c} \text{Summation over all} \\ \text{directions $d$ and pixels $p$} \end{array}} \underbrace{\frac{w_i(d, p)}{\sum_{(d,p)\in \mathcal{S}_i} w_i(d, p)}}_{\scriptsize \begin{array}{c} \text{Normalized weight of Gaussian $i$} \\ \text{at $(d,p)$ resulting from $\alpha$-blending} \\ \text{of Gaussians crossed along ray} \end{array}} \times \underbrace{F_{d,p}}_{\scriptsize \begin{array}{c} \text{Feature at pixel $p$} \\ \text{in direction $d$} \end{array}} \]

Diffusion on graphs

We define a graph where each Gaussian \( i \) is connected to its \( k \) nearest euclidean neighbors \( \mathcal{N}(i) \), with edge weights defined based on similarity between uplifted DINOv2 features. Given initial Gaussian features \( g_0 \) (e.g., a rough segmentation mask), the diffusion process runs as:
\[ g_{k+1} = Ag_k, \quad A_{ij} = \underbrace{\mathbf{1}_{j \in \mathcal{N}(i)}}_{\scriptsize \begin{array}{c} \text{$1$ if $j$ is in the} \\ \text{neighborhood of Gaussian $i$} \end{array}} \underbrace{S_f(f_i, f_j)}_{\scriptsize \begin{array}{c} \text{Similarity between} \\ \text{$f_i$ and $f_j$} \end{array}} . \underbrace{P(f_i)^{\frac 12}P(f_j)^{\frac 12}}_{\scriptsize \begin{array}{c} \text{Similarity between Gaussian} \\ \text{$i$ and reference Gaussians.} \\ \text{Prevents leakage into background.} \end{array}} \]
2D projection of the weight vector \( g_t \) during diffusion for foreground/background segmentation. \( g_0 \) is defined based on 2D scribbles of the object of interest provided on a reference view, which we uplift in 3D. The 3D scribbles spread to neighboring Gaussians with similar DINOv2 features.

LERF Localization

We uplift CLIP and DINOv2 features in 3D and use graph diffusion based on DINOv2 feature similarities to refine CLIP relevancy scores.

3D DINOv2 features (left), CLIP features (middle) and CLIP relevancy with text prompts (right).

Citing LUDVIG

Acknowledgements

The website template was borrowed from Ref-NeRF.