Referring Video Object Segmentation (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries. However, existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training—regardless of their actual relevance to the expression. To address this, we introduce a moment-aware RVOS framework named SAMDWICH, along with a newly annotated dataset, MeViS-M, built upon the challenging MeViS benchmark. We manually annotate temporal moments indicating when each object is referred to by the expression, enabling semantically grounded supervision that strengthens video-text alignment. SAMDWICH leverages these aligned text-to-clip pairs to guide training, significantly enhancing referential understanding. Building upon this framework, we propose Moment-guided Dual-path Propagation (MDP), a moment-aware propagation strategy that improves both object grounding and tracking by training on both relevant and irrelevant frames through a moment-centric memory mechanism. In addition, we introduce Object-level Selective Supervision (OSS), an object-level filtering strategy that supervises only the objects temporally aligned with the expression in each training clip. This selective supervision reduces semantic noise and reinforces language-conditioned learning. Extensive experiments show that SAMDWICH achieves state-of-the-art performance on challenging MeViS benchmark, particularly excelling in complex scenarios involving diverse expressions.
Importance of moment-aware approach. (a) Most existing methods rely on random frame sampling, leading to unnatural learning dynamics where moels are forced to segment referred objects even in frames unrelated to the given text. (b) We propse a novel RVOS pipeline, SAMDWICH, that explicitly focuses on text-relevant moments to enable semantically grounded segmentation.
MeViS-M dataset and analysis on valid set. (a) A moment annotation example from MeViS-M, showing temporal spans labeled for each object referred to by the given expression. (b) Comparison of top-1 accuracy for key frame selection on VLMs. The consistently low accuracy across all models underscores the limitation of existing VLMs in moment retrieval and highlights the necessity of fine-grained moment annotations.
Overall pipeline. In (a), \( \mathbf{F}_{\text{Adp}} \) of text-relevant frames are utilized for mask generation and memory update, while text-irrelevant frames employ \( \mathbf{F}_{\text{SAM}} \) for mask generation without contributing to the memory update. (b) illustrates how \( \mathbf{F}_{\text{Adp}} \) and \( \mathbf{F}_{\text{SAM}} \) are extracted from relevant and irrelevant frames, respectively, and how visual features are integrated into the prompt.
Comparison on the MeViS dataset. Oracle uses ground-truth moments from MeViS-M at inference. † indicates methods that leverage Vision-Language Models (VLMs) for keyframe selection. Oracle + Ours† uses VLMs to extract keyframes from ground-truth moments. We adopt Chrono (Meinardus et al. 2024) and BLIP-2 (Li et al. 2023) as keyframe selectors.
PCA-based feature maps and segmentation results of SAMWISE & SAMDWICH.