Rohit Girdhar
Rohit Girdhar
Home
Projects
Light
Dark
Automatic
paper-conference
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.
Changan Chen
,
Ashutosh Kumar
,
Rohit Girdhar
,
David Harwath
,
Kristen Grauman
PDF
Cite
InstanceDiffusion: Instance-level Control for Image Generation
SOTA instance-conditioned diffusion model for image generation.
Xudong Wang
,
Trevor Darrell
,
Sai Saketh Rambhatla
,
Rohit Girdhar
,
Ishan Misra
PDF
Cite
Code
Generating Illustrated Instructions
Introducing a new task of generating instructions for the task you want to solve with illustrations, and a LLM + Diffusion model based solution.
Sachit Menon
,
Ishan Misra
,
Rohit Girdhar
PDF
Cite
Code
VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation
SOTA unsupervised video segmentation using CutLER.
Xudong Wang
,
Ishan Misra
,
Ziyun Zeng
,
Rohit Girdhar
,
Trevor Darrell
PDF
Cite
Code
ImageBind: One Embedding Space To Bind Them All
One embedding space for 6 different modalities, enables zero-shot recognition on all modalities!
Rohit Girdhar
,
Alaaeldin El-Nouby
,
Zhuang Liu
,
Mannat Singh
,
Kalyan Vasudev Alwala
,
Armand Joulin
,
Ishan Misra
PDF
Cite
Video
Code
The effectiveness of MAE pre-pretraining for billion-scale pretraining
Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.
Mannat Singh
,
Quentin Duval
,
Kalyan Vasudev Alwala
,
Haoqi Fan
,
Vaibhav Aggarwal
,
Aaron Adcock
,
Armand Joulin
,
Piotr Dollár
,
Christoph Feichtenhofer
,
Ross Girshick
,
Rohit Girdhar
,
Ishan Misra
PDF
Cite
Code
CutLER: Cut and Learn for Unsupervised Object Detection and Instance Segmentation
Discovering objects using DINO features, and learning an unsupervised detection + segmentation model
Xudong Wang
,
Rohit Girdhar
,
Stella X. Yu
,
Ishan Misra
PDF
Cite
Code
HierVL: Learning Hierarchical Video-Language Embeddings
Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
Kumar Ashutosh
,
Rohit Girdhar
,
Lorenzo Torresani
,
Kristen Grauman
PDF
Cite
Video
Code
Learning Video Representations from Large Language Models
Leveraging LLMs to auto-annotate videos for representation learning.
Yue Zhao
,
Ishan Misra
,
Philipp Krähenbühl
,
Rohit Girdhar
PDF
Cite
Colab
Code
OmniMAE: Single Model Masked Pretraining on Images and Videos
Single self-supervised representation for images and videos.
Rohit Girdhar
,
Alaaeldin El-Nouby
,
Mannat Singh
,
Kalyan Vasudev Alwala
,
Armand Joulin
,
Ishan Misra
PDF
Cite
Video
Code
»
Cite
×