NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency

^{(1) & (4)} from University of Amsterdam, ⁽²⁾ from TNO,⁽³⁾ from ISTA,
⁽¹⁾ equal contribution
Accepted at ICLR'25

Abstract

We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training loss that enforces patch-level nearest neighbor consistency across a student and teacher model. Compared to contrastive approaches that only yield binary learning signals, i.e. 'attract' and 'repel', this approach benefits from the more fine-grained learning signal of sorting spatially dense features relative to reference patches. Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. This method generates high-quality dense feature encoders and establishes several new state-of-the-art results such as +5.5% and +6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff and improvements in the 3D understanding of multi-view consistency on SPair-71k, by more than 1.5%.

Methodology

The goal is to develop a feature space in which, for a given input, patches representing the same object exhibit similar features, whereas patches representing different objects show distinct features. A key challenge for a self-supervised method in this process is defining the similarity between image patches. Although patches from the same object (e.g. a cat) are expected to be more similar to each other than to those from different objects (e.g. a dog), they may still depict different parts of the object (e.g. a cat's tail and legs).

Therefore, a learned dense feature space must provide an ordering of similarities to ensure that patches from the same object and its parts are correctly distinguished from those of other objects. To this end, our method works by extracting dense features of the inputs, finding their pair-wise distances, and forcing a consistency between the order of nearest neighbors within a batch across two views.

Quantitative Results

Hummingbird Evaluation.

Linear Segmentation Evaluation.

Unsupervied Clustering Segmentation Evaluation.

BibTeX

@inproceedings{ pariza2025near, title={Near, far: Patch-ordering enhances vision foundation models' scene understanding}, author={Valentinos Pariza and Mohammadreza Salehi and Gertjan J. Burghouts and Francesco Locatello and Yuki M Asano}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=Qro97zWC29} }

NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency

Abstract

Methodology

Quantitative Results

Hummingbird Evaluation.

Linear Segmentation Evaluation.

Unsupervied Clustering Segmentation Evaluation.

Qualitative Results

NN Patch Results of NeCo on subset of Pascal VOC.

NN Patch Results of DINOv2R on subset of Pascal VOC.

Want to learn more about NeCo?

BibTeX