Paper-Journal

Human detectors are surprisingly powerful reward models

Using human detection confidence as a simple yet effective reward model to improve human motion in video generation.

Kumar Ashutosh, Xudong Wang, Xi Yin, Kristen Grauman, Adam Polyak, Ishan Misra, Rohit Girdhar

Human detectors are surprisingly powerful reward models

Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective

Introducing FreqWarm, a plug-and-play frequency warm-up curriculum that improves high-dimensional latent diffusion by increasing early-stage exposure to high-frequency signals.

Bolin Lai, Xudong Wang, Saketh Rambhatla, James M. Rehg, Zsolt Kira, Rohit Girdhar, Ishan Misra

Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective

Diffusion Autoencoders are Scalable Image Tokenizers

Simplified image tokenization using diffusion

Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra

Diffusion Autoencoders are Scalable Image Tokenizers

LLMs can see and hear without any training

Pure text-only LLMs can use off-the-shelf multimodal embedding models to do various multimodal tasks!

Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

LLMs can see and hear without any training

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Using flow to improve motion in video generation

Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Movie Gen: A Cast of Media Foundation Models

State-of-the-Art Video (+Audio) Generation Model

MovieGen Team (Core-Contributor)

Movie Gen: A Cast of Media Foundation Models

The Llama 3 Herd of Models

State-of-the-Art open-source LLM with multimodal capabilities

Llama3 Team (Co-Lead the Video Recognition Efforts)

The Llama 3 Herd of Models

Motion-Conditioned Image Animation for Video Editing

Image animation FTW again! SOTA video editing results by animating the first frame with motion conditioning.

Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi

Motion-Conditioned Image Animation for Video Editing

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

A simple and effective approach to high-quality video generation by learning to animate high quality images.

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Mask2Former for Video Instance Segmentation

SOTA video segmentation using Mask2Former.

Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing

Mask2Former for Video Instance Segmentation