Multimodal | Rohit Girdhar

Human detectors are surprisingly powerful reward models

Using human detection confidence as a simple yet effective reward model to improve human motion in video generation.

Kumar Ashutosh, Xudong Wang, Xi Yin, Kristen Grauman, Adam Polyak, Ishan Misra, Rohit Girdhar

Human detectors are surprisingly powerful reward models

LLMs can see and hear without any training

Pure text-only LLMs can use off-the-shelf multimodal embedding models to do various multimodal tasks!

Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

LLMs can see and hear without any training

Movie Gen: A Cast of Media Foundation Models

State-of-the-Art Video (+Audio) Generation Model

MovieGen Team (Core-Contributor)

Movie Gen: A Cast of Media Foundation Models

The Llama 3 Herd of Models

State-of-the-Art open-source LLM with multimodal capabilities

Llama3 Team (Co-Lead the Video Recognition Efforts)

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.

Changan Chen, Ashutosh Kumar, Rohit Girdhar, David Harwath, Kristen Grauman

Generating Illustrated Instructions

Introducing a new task of generating instructions for the task you want to solve with illustrations, and a LLM + Diffusion model based solution.

Sachit Menon, Ishan Misra, Rohit Girdhar

Motion-Conditioned Image Animation for Video Editing

Image animation FTW again! SOTA video editing results by animating the first frame with motion conditioning.

Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi

Motion-Conditioned Image Animation for Video Editing

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

A simple and effective approach to high-quality video generation by learning to animate high quality images.

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

ImageBind: One Embedding Space To Bind Them All

One embedding space for 6 different modalities, enables zero-shot recognition on all modalities!

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

ImageBind: One Embedding Space To Bind Them All

HierVL: Learning Hierarchical Video-Language Embeddings

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

HierVL: Learning Hierarchical Video-Language Embeddings