Rohit Girdhar

Rohit Girdhar

Research Scientist

GenAI Research, Meta

I am a Research Scientist in the GenAI Research group at Meta. My current research focuses on understanding and generating multimodal data, using minimal human supervision. I obtained a MS and PhD in Robotics from Carnegie Mellon University (here’s a link to my dissertation), where I worked on learning from and understanding videos. I was previously part of the Facebook AI Research (FAIR) group at Meta, and have spent time at DeepMind, Adobe and Facebook as an intern. See here for a formal bio.

News

Education
  • PhD in Robotics, 2019

    Carnegie Mellon University, Pittsburgh PA

  • MS in Robotics, 2016

    Carnegie Mellon University, Pittsburgh PA

  • B. Tech. in Computer Science, 2014

    IIIT Hyderabad, India

Experience
  • Meta · Research Scientist

    New York · 2019 -- Present

  • DeepMind · Research Scientist Intern

    London · Summer 2018

  • Facebook · Research Scientist Intern

    Menlo Park · Summer 2017

  • Adobe · Research Scientist Intern

    San Francisco · Summer 2016

  • Facebook · Software Engineering Intern

    Menlo Park · Summer 2013

Highlights

Videos powered by Emu Video!


Projects and Publications

.js-id-selected
VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation
VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

SOTA unsupervised video segmentation using CutLER.

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

A simple and effective approach to high-quality video generation by learning to animate high quality images.

ImageBind: One Embedding Space To Bind Them All
ImageBind: One Embedding Space To Bind Them All

One embedding space for 6 different modalities, enables zero-shot recognition on all modalities!

The effectiveness of MAE pre-pretraining for billion-scale pretraining
The effectiveness of MAE pre-pretraining for billion-scale pretraining

Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.

Learning Video Representations from Large Language Models
Learning Video Representations from Large Language Models

Leveraging LLMs to auto-annotate videos for representation learning.

OmniMAE: Single Model Masked Pretraining on Images and Videos
OmniMAE: Single Model Masked Pretraining on Images and Videos

Single self-supervised representation for images and videos.

Omnivore: A Single Model for Many Visual Modalities
Omnivore: A Single Model for Many Visual Modalities

A single model for images, video and single-view 3D.

Detecting Twenty-thousand Classes using Image-level Supervision
Detecting Twenty-thousand Classes using Image-level Supervision

Leverages image classification data to build an object detector

Mask2Former for Video Instance Segmentation
Mask2Former for Video Instance Segmentation

SOTA video segmentation using Mask2Former.

Masked-attention Mask Transformer for Universal Image Segmentation
Masked-attention Mask Transformer for Universal Image Segmentation

Single architecture state-of-the-art in instance, semantic and panoptic segmentation.

3DETR: An End-to-End Transformer Model for 3D Object Detection
3DETR: An End-to-End Transformer Model for 3D Object Detection

First Transformer based detection architecture for 3D data.

Physical Reasoning Using Dynamics Aware Embeddings
Physical Reasoning Using Dynamics Aware Embeddings

Self-supervised representations for physical reasoning.

Forward Prediction for Physical Reasoning
Forward Prediction for Physical Reasoning

Forward prediction for PHYRE benchmark.

Video Action Transformer Network
Video Action Transformer Network

Among the first applications of Transformers to model videos. SOTA results: close 2nd at AVA Challenge, CVPR'18.

ActionVLAD: Learning spatio-temporal aggregation for action classification
ActionVLAD: Learning spatio-temporal aggregation for action classification

Aggregating visual features for action recognition.

Binge Watching: Scaling Affordance Learning from Sitcoms
Binge Watching: Scaling Affordance Learning from Sitcoms

Learning how humans interact with their environment by watching TV.

Learning a Predictable and Generative Vector Representation for Objects
Learning a Predictable and Generative Vector Representation for Objects

A single embedding space, good for both generating and understanding 3D models