Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
A simple and effective approach to high-quality video generation by learning to animate high quality images.
The effectiveness of MAE pre-pretraining for billion-scale pretraining
Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.
Omnivore: A Single Model for Many Visual Modalities
A single model for images, video and single-view 3D.
Mask2Former for Video Instance Segmentation
SOTA video segmentation using Mask2Former.
Forward Prediction for Physical Reasoning
Forward prediction for PHYRE benchmark.