Rohit Girdhar

Rohit Girdhar

Research Scientist

GenAI Research, Meta

I am a Research Scientist in the GenAI Research group at Meta. My current research focuses on understanding and generating multimodal data, using minimal human supervision. I obtained a MS and PhD in Robotics from Carnegie Mellon University (here’s a link to my dissertation), where I worked on learning from and understanding videos. I was previously part of the Facebook AI Research (FAIR) group at Meta, and have spent time at DeepMind, Adobe and Facebook as an intern. See here for a formal bio.

News

Education
  • PhD in Robotics, 2019

    Carnegie Mellon University, Pittsburgh PA

  • MS in Robotics, 2016

    Carnegie Mellon University, Pittsburgh PA

  • B. Tech. in Computer Science, 2014

    IIIT Hyderabad, India

Experience
  • Meta · Research Scientist

    New York · 2019 -- Present

  • DeepMind · Research Scientist Intern

    London · Summer 2018

  • Facebook · Research Scientist Intern

    Menlo Park · Summer 2017

  • Adobe · Research Scientist Intern

    San Francisco · Summer 2016

  • Facebook · Software Engineering Intern

    Menlo Park · Summer 2013

Highlights

Videos powered by Emu Video!


Projects and Publications

.js-id-selected
The Llama 3 Herd of Models
The Llama 3 Herd of Models

State-of-the-Art open-source LLM with multimodal capabilities

InstanceDiffusion: Instance-level Control for Image Generation
InstanceDiffusion: Instance-level Control for Image Generation

SOTA instance-conditioned diffusion model for image generation.

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation
VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

SOTA unsupervised video segmentation using CutLER.

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

A simple and effective approach to high-quality video generation by learning to animate high quality images.

ImageBind: One Embedding Space To Bind Them All
ImageBind: One Embedding Space To Bind Them All

One embedding space for 6 different modalities, enables zero-shot recognition on all modalities!

The effectiveness of MAE pre-pretraining for billion-scale pretraining
The effectiveness of MAE pre-pretraining for billion-scale pretraining

Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.

Learning Video Representations from Large Language Models
Learning Video Representations from Large Language Models

Leveraging LLMs to auto-annotate videos for representation learning.

OmniMAE: Single Model Masked Pretraining on Images and Videos
OmniMAE: Single Model Masked Pretraining on Images and Videos

Single self-supervised representation for images and videos.

Omnivore: A Single Model for Many Visual Modalities
Omnivore: A Single Model for Many Visual Modalities

A single model for images, video and single-view 3D.

Detecting Twenty-thousand Classes using Image-level Supervision
Detecting Twenty-thousand Classes using Image-level Supervision

Leverages image classification data to build an object detector

Mask2Former for Video Instance Segmentation
Mask2Former for Video Instance Segmentation

SOTA video segmentation using Mask2Former.

Physical Reasoning Using Dynamics Aware Embeddings
Physical Reasoning Using Dynamics Aware Embeddings

Self-supervised representations for physical reasoning.

Forward Prediction for Physical Reasoning
Forward Prediction for Physical Reasoning

Forward prediction for PHYRE benchmark.

Binge Watching: Scaling Affordance Learning from Sitcoms
Binge Watching: Scaling Affordance Learning from Sitcoms

Learning how humans interact with their environment by watching TV.