Rohit Girdhar

Research Scientist

GenAI Research, Meta

I am a Research Scientist in the GenAI Research group at Meta. My current research focuses on understanding and generating multimodal data, using minimal human supervision. I obtained a MS and PhD in Robotics from Carnegie Mellon University (here’s a link to my dissertation), where I worked on learning from and understanding videos. I was previously part of the Facebook AI Research (FAIR) group at Meta, and have spent time at DeepMind, Adobe and Facebook as an intern. See here for a formal bio.

News

[October'2024] Mark Zuckerberg announced our work on MovieGen, the new state-of-the-art media generation and editing system, outperforming SORA, Emu Video and more! Covered in NY Times, FT, Forbes, WIRED, Bloomberg, TechCrunch, etc.
[July'2024] Mark Zuckerberg announced Llama 3.1, along with our state-of-the-art video recognition capabilities!
[June'2024] Invited panelist for the AI for Content Creation (AI4CC) workshop at CVPR 2024 (along with Cynthia Lu and Robin Rombach).
[June'2024] LaViLa and Ego4D among the winners of the EgoVis 2022-23 Distinguished Paper Awards!
[April'2024] Presented Emu Video at RunwayML's inaugural Research and Art (RNA) event.
[Feb'2024] Invited judge for the MIT Filmmaking Hackathhon 2024.
[April'2024] /animate functionality based on Emu Video is publicly released! Try it out to animate images generated using /imagine on meta.ai!
[Nov'2023] Mark Zuckerberg announced our state-of-the-art video generation work, Emu Video! Also see coverage by TechCrunch, TheVerge, VentureBeat, Reuters, and others!
[Oct'2023] Giving talks at the DeepMind AI Video symposium, and Perception Test workshop at ICCV 2023 (video).
[June'2023] Giving a talk at HVU Workshop and presenting 5 papers at CVPR 2023!
[May'2023] Mark Zuckerberg announced our multimodal embedding work, ImageBind! Also see coverage by TheVerge, Engadget, SiliconANGLE, maginative and others!
[June'2022] Presenting 3 papers at CVPR 2022, including Omivore, a single model that obtains state-of-the-art results across 3 different modalities: images, videos and single-view 3D!
[Oct'2021] We announced Ego4D, the largest egocentric video dataset to date! See this video for a quick intro, and see coverage from TechCrunch, TheVerge, Axios, Fast Company, and others!

Education

PhD in Robotics, 2019

Carnegie Mellon University, Pittsburgh PA
MS in Robotics, 2016

Carnegie Mellon University, Pittsburgh PA
B. Tech. in Computer Science, 2014

IIIT Hyderabad, India

Experience

Meta · Research Scientist

New York · 2019 -- Present
DeepMind · Research Scientist Intern

London · Summer 2018
Facebook · Research Scientist Intern

Menlo Park · Summer 2017
Adobe · Research Scientist Intern

San Francisco · Summer 2016
Facebook · Software Engineering Intern

Menlo Park · Summer 2013

Highlights

Videos powered by MovieGen and Emu Video!

Projects and Publications

Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra

January, 2025 In arXiv, 2025

Diffusion Autoencoders are Scalable Image Tokenizers

Simplified image tokenization using diffusion

Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

January, 2025 In ICML, 2025

LLMs can see and hear without any training

Pure text-only LLMs can use off-the-shelf multimodal embedding models to do various multimodal tasks!

Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin

December, 2024 In CVPR, 2025

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Using flow to improve motion in video generation

MovieGen team (core-contributor)

October, 2024 In arXiv, 2024

Movie Gen: A Cast of Media Foundation Models

State-of-the-Art Video (+Audio) Generation Model

Llama3 team (co-lead the video recognition efforts)

July, 2024 In arXiv, 2024

The Llama 3 Herd of Models

State-of-the-Art open-source LLM with multimodal capabilities

Changan Chen, Ashutosh Kumar, Rohit Girdhar, David Harwath, Kristen Grauman

April, 2024 In CVPR, 2024

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.

Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra

February, 2024 In CVPR, 2024

InstanceDiffusion: Instance-level Control for Image Generation

SOTA instance-conditioned diffusion model for image generation.

Sachit Menon, Ishan Misra, Rohit Girdhar

December, 2023 In CVPR, 2024

Generating Illustrated Instructions

Introducing a new task of generating instructions for the task you want to solve with illustrations, and a LLM + Diffusion model based solution.

Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi

November, 2023 In arXiv, 2023

Motion-Conditioned Image Animation for Video Editing

Image animation FTW again! SOTA video editing results by animating the first frame with motion conditioning.

Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

November, 2023 In CVPR, 2024

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

SOTA unsupervised video segmentation using CutLER.

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

November, 2023 In arXiv, 2023

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

A simple and effective approach to high-quality video generation by learning to animate high quality images.

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

May, 2023 In CVPR, 2023 (Highlighted Presentation)

ImageBind: One Embedding Space To Bind Them All

One embedding space for 6 different modalities, enables zero-shot recognition on all modalities!

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

March, 2023 In ICCV, 2023

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.

Xudong Wang, Rohit Girdhar, Stella X. Yu, Ishan Misra

January, 2023 In CVPR, 2023

CutLER: Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Discovering objects using DINO features, and learning an unsupervised detection + segmentation model

Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

January, 2023 In CVPR, 2023 (Highlighted Presentation)

HierVL: Learning Hierarchical Video-Language Embeddings

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.

Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar

December, 2022 In CVPR, 2023 (Highlighted Presentation)

Learning Video Representations from Large Language Models

Leveraging LLMs to auto-annotate videos for representation learning.

Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

June, 2022 In CVPR, 2023

OmniMAE: Single Model Masked Pretraining on Images and Videos

Single self-supervised representation for images and videos.

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra

June, 2022 In CVPR, 2022 (Oral Presentation)

Omnivore: A Single Model for Many Visual Modalities

A single model for images, video and single-view 3D.

Kristen Grauman, Andrew Westbury, Rohit Girdhar, et al

March, 2022 In CVPR, 2022 (Best paper finalist)

Ego4D: Around the World in 3,000 Hours of Egocentric Video

The largest egocentric video dataset.

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra

January, 2022 In ECCV, 2022

Detecting Twenty-thousand Classes using Image-level Supervision

Leverages image classification data to build an object detector

Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing

December, 2021 In arXiv, 2021

Mask2Former for Video Instance Segmentation

SOTA video segmentation using Mask2Former.

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

December, 2021 In CVPR, 2022

Masked-attention Mask Transformer for Universal Image Segmentation

Single architecture state-of-the-art in instance, semantic and panoptic segmentation.

Ishan Misra, Rohit Girdhar, Armand Joulin

September, 2021 In ICCV, 2021 (Oral Presentation)

3DETR: An End-to-End Transformer Model for 3D Object Detection

First Transformer based detection architecture for 3D data.

Rohit Girdhar, Kristen Grauman

June, 2021 In ICCV, 2021

Anticipative Video Transformer

An autoregressive video transformer architecture for action anticipation in videos.

Zhongzheng Ren, Ishan Misra, Alexander G. Schwing, Rohit Girdhar

May, 2021 In CVPR, 2021

3D Spatial Recognition without Spatially Labeled 3D

WyPR can detect and segment objects in a 3D scene without needing any spatial labels at all!

Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra

May, 2021 In CVPR, 2021

Self-Supervised Pretraining of 3D Features on any Point-Cloud

SOTA 3D detection/segmentation results by learning contrastive representations on 3D data

Eltayeb Ahmed, Anton Bakhtin, Laurens van der Maaten, Rohit Girdhar

February, 2021 In ICML Workshops, 2021

Physical Reasoning Using Dynamics Aware Embeddings

Self-supervised representations for physical reasoning.

Rohit Girdhar, Laura Gustafson, Aaron Adcock, Laurens van der Maaten

June, 2020 In ICML Workshops, 2021

Forward Prediction for Physical Reasoning

Forward prediction for PHYRE benchmark.

Rohit Girdhar, Deva Ramanan

October, 2019 In ICLR, 2020 (Oral Presentation)

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

A dataset to evaluate temporal reasoning in video models.

Jessica Lee, Deva Ramanan, Rohit Girdhar

October, 2019 In ICLR, 2020

MetaPix: Few-Shot Video Retargeting

A dataset to evaluate temporal reasoning in video models.

Rohit Girdhar, Du Tran, Lorenzo Torresani, Deva Ramanan

January, 2019 In ICCV, 2019

DistInit: Learning Video Representations Without a Single Labeled Video

Distilling representations from image models to video models.

Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

December, 2018 In CVPR, 2019 (Oral Presentation)

Video Action Transformer Network

Among the first applications of Transformers to model videos. SOTA results: close 2nd at AVA Challenge, CVPR'18.

Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran

December, 2017 In CVPR, 2018

Detect-and-Track: Efficient Pose Estimation in Videos

Human keypoint tracking approach that ranked first in ICCV 2017 PoseTrack keypoint tracking challenge!

Rohit Girdhar, Deva Ramanan

November, 2017 In NeurIPS, 2017

Attentional Pooling for Action Recognition

Among the first applications of attention for contemporary video/action understanding.

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell

April, 2017 In CVPR, 2017

ActionVLAD: Learning spatio-temporal aggregation for action classification

Aggregating visual features for action recognition.

Xiaolong Wang, Rohit Girdhar, Abhinav Gupta

March, 2016 In CVPR, 2017 (Spotlight Presentation)

Binge Watching: Scaling Affordance Learning from Sitcoms

Learning how humans interact with their environment by watching TV.

Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, Abhinav Gupta

March, 2016 In ECCV, 2016 (Spotlight Presentation)

Learning a Predictable and Generative Vector Representation for Objects

A single embedding space, good for both generating and understanding 3D models