We propose the sorting of patch representations across views as a novel self-supervised learning signal to improve pretrained representations. To this end, we introduce NeCo: Patch Neighbor Consistency, a novel training loss that enforces patch-level nearest neighbor consistency across a student and teacher model, relative to reference batches. Our method leverages a differentiable sorting method that is applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-training leads to superior performance across various models and datasets, despite requiring only 16 hours on a single GPU. We demonstrate that this method generates high-quality dense feature encoders and establish several new state-of-the-art results: +5.5% and + 6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, and +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff.
Given an input image \(I\), two augmentations \(\tau_1\) and \(\tau_2\) are applied to create two different views, which are processed by the teacher and student encoders, \(\phi_t\) and \(\phi_s\) respectively. The teacher encoder is updated using Exponential Moving Average (EMA). The encoded features are then aligned using ROI Align and compared with reference features \(F_r\) obtained by applying \(\phi_s\) to other batch images. Next, pairwise distances \(D_{ij}\) between \(F_s\) and \(F_r\), as well as between \(F_t\) and \(F_r\), are computed using cosine similarity. These distances are then sorted using differentiable sorting and utilized to force nearest order consistency through NeCo loss.