NeCo: Improving DINOv2’s spatial representations in 16 GPU hours with Patch Neighbor Consistency

Abstract

We propose the sorting of patch representations across views as a novel self-supervised learning signal to improve pretrained representations. To this end, we introduce NeCo: Patch Neighbor Consistency, a novel training loss that enforces patch-level nearest neighbor consistency across a student and teacher model, relative to reference batches. Our method leverages a differentiable sorting method that is applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-training leads to superior performance across various models and datasets, despite requiring only 16 hours on a single GPU. We demonstrate that this method generates high-quality dense feature encoders and establish several new state-of-the-art results: +5.5% and + 6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, and +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff.

Given an input image \(I\), two augmentations \(\tau_1\) and \(\tau_2\) are applied to create two different views, which are processed by the teacher and student encoders, \(\phi_t\) and \(\phi_s\) respectively. The teacher encoder is updated using Exponential Moving Average (EMA). The encoded features are then aligned using ROI Align and compared with reference features \(F_r\) obtained by applying \(\phi_s\) to other batch images. Next, pairwise distances \(D_{ij}\) between \(F_s\) and \(F_r\), as well as between \(F_t\) and \(F_r\), are computed using cosine similarity. These distances are then sorted using differentiable sorting and utilized to force nearest order consistency through NeCo loss.

Want to learn more about NeCo?

Check out our paper and code!