Representing images using region-based representations from REN (instead of conventional patch-based ones) improves performance across multiple tasks, while resulting in compact and content-aware representations.
We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders—DINO, DINOv2, and OpenCLIP—and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single-needle challenge.
Point prompts interact with patch-based features through cross-attention blocks to produce region
tokens. The training objective combines two components: (1) a contrastive loss that aligns region
tokens with those generated from an augmented view of the same image, and (2) a feature similarity
loss that aligns a linear projection of these tokens with average-pooled patch features obtained
using SAM masks. REN eliminates the need for explicit segmentation at inference time while
producing efficient and semantically rich region representations. We also show thresholded attention
maps for three query points inside the cross-attention block, which show that the model learns to
aggregate features primarily from the regions marked by the corresponding point prompts.
REN can effectively localizes visual queries in long videos despite challenges like clutter, occlusions, background blending, motion blur, viewpoint changes, and brief visibility. On the Ego4D VQ2D benchmark, REN outperforms all existing approaches, including those specifically developed for this benchmark.
| Method | stAP | tAP | Success | Recovery |
|---|---|---|---|---|
| SiamRCNN | 0.13 | 0.21 | 41.6 | 34.0 |
| CocoFormer | 0.18 | 0.26 | 48.1 | 43.2 |
| VQLoC | 0.24 | 0.32 | 55.9 | 45.1 |
| HERO-VQL | 0.28 | 0.37 | 60.7 | 45.3 |
| PRVQL | 0.28 | 0.37 | 59.4 | 45.7 |
| RELOCATE | 0.35 | 0.43 | 60.1 | 50.6 |
| REN | 0.40 | 0.52 | 61.2 | 49.3 |
REN improves semantic segmentation performance across different image encoders. Its region tokens produce cleaner and less noisy predictions compared to the patch-based features used in DINOv2.
| Method | VOC2012 | ADE20K |
|---|---|---|
| DINOv2 | 82.1 | 47.7 |
| REN-DINOv2 | 86.5 | 50.9 |
| DINO | 66.4 | 31.8 |
| REN-DINO | 71.4 | 35.1 |
| OpenCLIP | 71.4 | 39.3 |
| REN-OpenCLIP | 78.0 | 42.8 |
On the Visual Haystacks' single-needle challenge, REN outperforms proprietary LMMs, open-source LMMs, and RAG-based methods, especially for larger number of input images (denoted by N). "E" indicates context overflow, execution failure, or API error.
| Method | N=1 | N=2 | N=3 | N=5 | N=10 | N=20 | N=50 | N=100 | N=500 | N=1K |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 1.5 Pro | 88.4 | 82.0 | 78.3 | 76.0 | 71.9 | 68.6 | 62.8 | 57.4 | E | E |
| GPT-4o | 82.5 | 79.9 | 77.5 | 73.3 | 68.2 | 65.4 | 59.7 | 55.3 | E | E |
| LongVILA | 63.8 | 59.0 | 57.7 | 56.7 | 55.6 | 52.0 | 52.0 | 52.0 | E | E |
| Qwen2-VL | 80.9 | 76.6 | 73.6 | 67.9 | 62.6 | 59.1 | 52.6 | E | E | E |
| Phi-3 | 80.5 | 69.1 | 67.3 | 62.0 | 54.8 | 52.6 | 50.8 | E | E | E |
| InternVL2 | 88.1 | 80.5 | 72.3 | 63.9 | 58.8 | 55.2 | E | E | E | E |
| mPLUG-OWL3 | 84.4 | 66.0 | 62.1 | 57.0 | 53.2 | 51.5 | E | E | E | E |
| LLaVA-v1.5 | 85.8 | 77.1 | 75.8 | 68.6 | 63.6 | 60.4 | 55.3 | 57.5 | 55.4 | 52.9 |
| MIRAGE | 83.2 | 77.8 | 76.6 | 72.8 | 70.5 | 66.0 | 63.6 | 62.0 | 58.7 | 55.7 |
| SigLIP 2 | 72.0 | 69.2 | 68.1 | 65.3 | 64.1 | 60.3 | 58.7 | 58.3 | 56.6 | 54.9 |
| REN | 81.2 | 78.6 | 77.4 | 76.0 | 74.0 | 72.1 | 68.3 | 65.5 | 62.3 | 59.2 |
Region-based methods outperform the patch-based baseline, and REN further surpasses the SAM-based baseline while offering faster and more efficient region token generation.
| Method | mAP | mRP@50 |
|---|---|---|
| DINOv2 | 0.13 | 0.33 |
| SAM-DINOv2 | 0.45 | 0.58 |
| REN | 0.52 | 0.65 |
@inproceedings{khosla2025ren,
title={REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders},
author={Savya Khosla and Sethuraman T V and Barnett Lee and Alexander Schwing and Derek Hoiem},
journal={Neural Information Processing Systems},
year={2025}
}