Selected

LLMs can see and hear without any training

Pure text-only LLMs can use off-the-shelf multimodal embedding models to do various multimodal tasks!

Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

LLMs can see and hear without any training

Movie Gen: A Cast of Media Foundation Models

State-of-the-Art Video (+Audio) Generation Model

MovieGen Team (Core-Contributor)

Movie Gen: A Cast of Media Foundation Models

The Llama 3 Herd of Models

State-of-the-Art open-source LLM with multimodal capabilities

Llama3 Team (Co-Lead the Video Recognition Efforts)

The Llama 3 Herd of Models

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

A simple and effective approach to high-quality video generation by learning to animate high quality images.

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

ImageBind: One Embedding Space To Bind Them All

One embedding space for 6 different modalities, enables zero-shot recognition on all modalities!

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

ImageBind: One Embedding Space To Bind Them All

Learning Video Representations from Large Language Models

Leveraging LLMs to auto-annotate videos for representation learning.

Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar

Learning Video Representations from Large Language Models

Omnivore: A Single Model for Many Visual Modalities

A single model for images, video and single-view 3D.

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens Van Der Maaten, Armand Joulin, Ishan Misra

Ego4D: Around the World in 3,000 Hours of Egocentric Video

The largest egocentric video dataset.

Kristen Grauman, Andrew Westbury, Rohit Girdhar, Et Al

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Masked-attention Mask Transformer for Universal Image Segmentation

Single architecture state-of-the-art in instance, semantic and panoptic segmentation.

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

Masked-attention Mask Transformer for Universal Image Segmentation

Anticipative Video Transformer

An autoregressive video transformer architecture for action anticipation in videos.

Rohit Girdhar, Kristen Grauman

Anticipative Video Transformer