Representation | Rohit Girdhar

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.

Changan Chen, Ashutosh Kumar, Rohit Girdhar, David Harwath, Kristen Grauman

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

SOTA unsupervised video segmentation using CutLER.

XuDong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

The effectiveness of MAE pre-pretraining for billion-scale pretraining

CutLER: Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Discovering objects using DINO features, and learning an unsupervised detection + segmentation model

XuDong Wang, Rohit Girdhar, Stella X. Yu, Ishan Misra

OmniMAE: Single Model Masked Pretraining on Images and Videos

Single self-supervised representation for images and videos.

Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

Omnivore: A Single Model for Many Visual Modalities

A single model for images, video and single-view 3D.

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens Van Der Maaten, Armand Joulin, Ishan Misra

Detecting Twenty-thousand Classes using Image-level Supervision

Leverages image classification data to build an object detector

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra

Detecting Twenty-thousand Classes using Image-level Supervision

Self-Supervised Pretraining of 3D Features on any Point-Cloud

SOTA 3D detection/segmentation results by learning contrastive representations on 3D data

Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra

DistInit: Learning Video Representations Without a Single Labeled Video

Distilling representations from image models to video models.

Rohit Girdhar, Du Tran, Lorenzo Torresani, Deva Ramanan

DistInit: Learning Video Representations Without a Single Labeled Video