核心概念
Enhancing CLIP features for open-vocabulary semantic segmentation without annotations.
摘要
The article introduces CLIP-DINOiser, a method that improves MaskCLIP features for semantic segmentation without annotations. It combines self-supervised DINO features with CLIP to enhance segmentation results. The method achieves state-of-the-art performance on challenging datasets like COCO, Pascal Context, Cityscapes, and ADE20k. The approach involves training light convolutional layers to refine MaskCLIP features and improve segmentation quality.
Structure:
- Introduction
- Semantic segmentation importance in real-world systems.
- Shift from closed-vocabulary to open-world models.
- Related Work
- Approaches for zero-shot semantic segmentation.
- Challenges in extending CLIP to open-vocabulary segmentation.
- Method
- CLIP-DINOiser strategy to improve MaskCLIP features.
- Leveraging self-supervised DINO features for localization.
- Experiments
- Experimental setup details and datasets used.
- Comparison with state-of-the-art methods.
- Conclusions
- CLIP-DINOiser's success in open-vocabulary semantic segmentation.
统计
Our method CLIP-DINOiser achieves state-of-the-art results on challenging datasets like COCO, Pascal Context, Cityscapes, and ADE20k.
The approach involves training light convolutional layers to refine MaskCLIP features and improve segmentation quality.
The method only requires a single forward pass of CLIP model and two light convolutional layers at inference.
引用
"Our method greatly improves the performance of MaskCLIP and produces smooth outputs."
"CLIP-DINOiser reaches state-of-the-art results on challenging and fine-grained benchmarks."