Video | Rohit Girdhar

Human detectors are surprisingly powerful reward models

Using human detection confidence as a simple yet effective reward model to improve human motion in video generation.

Kumar Ashutosh, Xudong Wang, Xi Yin, Kristen Grauman, Adam Polyak, Ishan Misra, Rohit Girdhar

Human detectors are surprisingly powerful reward models

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Using flow to improve motion in video generation

Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Movie Gen: A Cast of Media Foundation Models

State-of-the-Art Video (+Audio) Generation Model

MovieGen Team (Core-Contributor)

Movie Gen: A Cast of Media Foundation Models

The Llama 3 Herd of Models

State-of-the-Art open-source LLM with multimodal capabilities

Llama3 Team (Co-Lead the Video Recognition Efforts)

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.

Changan Chen, Ashutosh Kumar, Rohit Girdhar, David Harwath, Kristen Grauman

Motion-Conditioned Image Animation for Video Editing

Image animation FTW again! SOTA video editing results by animating the first frame with motion conditioning.

Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi

Motion-Conditioned Image Animation for Video Editing

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

SOTA unsupervised video segmentation using CutLER.

Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

A simple and effective approach to high-quality video generation by learning to animate high quality images.

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

ImageBind: One Embedding Space To Bind Them All

One embedding space for 6 different modalities, enables zero-shot recognition on all modalities!

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

ImageBind: One Embedding Space To Bind Them All

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

The effectiveness of MAE pre-pretraining for billion-scale pretraining