Paper-Conference | Rohit Girdhar

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.

Changan Chen, Ashutosh Kumar, Rohit Girdhar, David Harwath, Kristen Grauman

InstanceDiffusion: Instance-level Control for Image Generation

SOTA instance-conditioned diffusion model for image generation.

Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra

InstanceDiffusion: Instance-level Control for Image Generation

Generating Illustrated Instructions

Introducing a new task of generating instructions for the task you want to solve with illustrations, and a LLM + Diffusion model based solution.

Sachit Menon, Ishan Misra, Rohit Girdhar

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

SOTA unsupervised video segmentation using CutLER.

Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

ImageBind: One Embedding Space To Bind Them All

One embedding space for 6 different modalities, enables zero-shot recognition on all modalities!

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

ImageBind: One Embedding Space To Bind Them All

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

The effectiveness of MAE pre-pretraining for billion-scale pretraining

CutLER: Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Discovering objects using DINO features, and learning an unsupervised detection + segmentation model

Xudong Wang, Rohit Girdhar, Stella X. Yu, Ishan Misra

HierVL: Learning Hierarchical Video-Language Embeddings

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

HierVL: Learning Hierarchical Video-Language Embeddings

Learning Video Representations from Large Language Models

Leveraging LLMs to auto-annotate videos for representation learning.

Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar

Learning Video Representations from Large Language Models

OmniMAE: Single Model Masked Pretraining on Images and Videos

Single self-supervised representation for images and videos.

Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra