SMITE: Segment Me In TimE

Amir Alimohammadi1, Sauradip Nag1, Saeid Asgari Taghanaki1, 2, Andrea Tagliasacchi1, 3, 4, Ghassan Hamarneh1, Ali Mahdavi Amiri1
  1Simon Fraser University   2Autodesk Research   3University of Toronto   4Google DeepMind

TL;DR : Segment Any Granularity of a subject in a video
with reference annotation for only few frames of the subject.

The audio track for this teaser video was generated with the help of Suno.



Abstract

Segmenting an object in a video presents significant challenges. Each pixel must be accurately labeled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.

Multi-Granularity Results

SMITE can segment in multiple granularities from coarse to fine segments.


Challenging Scenarios

SMITE can segment in challenging scenarios like camouflage, occlusion and cut-scenes.

Source Video SMITE (ours)

Source Video SMITE (ours)

Methodology

In this work, we present SMITE, a video segmentation method that enhances pretrained text-to-image diffusion models with temporal attention to maintain consistency across video frames. Additionally, we introduce a temporal voting mechanism (Figure in the paper) that tracks and projects pixels over attention maps, ensuring consistent labeling for each pixel. This approach significantly reduces flickering and noise compared to per-frame segmentation techniques, while still adhering to reference images. Our low-pass regularization technique further preserves the segment structure defined by the attention maps, optimizing it in alignment with the reference images.


Qualitative Results

Source Video Source Video

Baseline-I Source Video

Grounded SAM 2 Source Video

SMITE (ours) Source Video

Source Video Source Video

Baseline-I Baseline-I

Grounded SAM 2 Grounded SAM 2

SMITE (ours) SMITE (ours)

Source Video Source Video

Baseline-I Baseline-I

Grounded SAM 2 Grounded SAM 2

SMITE (ours) SMITE (ours)

Comparison with XMem++

Using a Single Reference Image on a Long Video (Over 1,000 Frames, Video Speed Increased by 4x)


Source Image Source Annotation XMem++ SMITE (ours)
Source Image Source Annotation

SMITE-50 Dataset and Benchmark

SMITE-50 is a video dataset designed for challenging segmentation tasks involving multiple object parts in difficult scenarios such as occlusion. It consists of 50 videos, each up to 20 seconds long, with frame counts ranging from 24 to 400 and various aspect ratios (vertical and horizontal). The dataset includes four main classes: “Horses,” “Faces,” “Cars,” and “Non-Text.” Videos in the “Horses” and “Cars” categories, captured outdoors, present challenges like occlusion, viewpoint changes, and fast-moving objects in dynamic backgrounds, while “Faces” involve occlusion, scale changes, and fine-grained parts that are difficult to track and segment over time. The “Non-Text” category includes videos with parts that cannot be described using natural language, making them challenging for zero-shot video segmentation models that rely on textual vocabularies. Primarily sourced from Pexels, SMITE-50 features multi-granularity annotations, focusing on Horses, Human Faces, and Cars, with a total of 41 videos. Each subset includes ten segmented reference images for training and densely annotated videos for testing, with granularity varying from human eyes to animal heads, relevant for applications like VFX. Additionally, nine videos feature segments that cannot be described textually. The dataset includes dense annotations, with masks created for every fifth frame and an average of six parts per frame across three granularity types. Compared to PumaVOS, which has 8% dense annotations, SMITE-50 provides 20% dense annotations. Although still a work in progress, SMITE-50 aims to be publicly available in the future.

BibTeX

@misc{alimohammadi2024smitesegmenttime,
                          title={SMITE: Segment Me In TimE}, 
                          author={Amirhossein Alimohammadi and Sauradip Nag and Saeid Asgari Taghanaki and Andrea Tagliasacchi and Ghassan Hamarneh and Ali Mahdavi Amiri},
                          year={2024},
                          eprint={2410.18538},
                          archivePrefix={arXiv},
                          primaryClass={cs.CV},
                          url={https://arxiv.org/abs/2410.18538}, 
                    }