Video generation models as world simulators

This report delves into a method for unifying visual data representation to facilitate the large-scale training of generative models and evaluates the capabilities and limitations of a model named Sora. Unlike previous works that often focus on specific types of visual data, Sora is a generalist model capable of generating videos and images of various durations, aspect ratios, and resolutions.

Inspired by large language models (LLMs), Sora uses visual patches as tokens, similar to how LLMs use text tokens. This approach has proven effective for training generative models on diverse videos and images. Sora employs a video compression network to reduce the dimensionality of visual data into a latent space, which is then decomposed into spacetime patches.

Sora is a diffusion transformer, a type of model that has shown remarkable scaling properties across different domains. It can predict "clean" patches from noisy inputs, improving sample quality as training compute increases. Training on data at its native size has shown benefits in sampling flexibility, improved framing and composition, and language understanding.

Sora can also be prompted with images or videos, allowing for a wide range of editing tasks. It can animate static images, extend videos in time, and even simulate actions that affect the state of the world. The model can generate images up to 2048x2048 resolution and has shown emergent capabilities such as 3D consistency, long-range coherence, and object permanence.

Despite its advancements, Sora has limitations, such as not accurately modeling the physics of certain interactions. However, the demonstrated capabilities suggest that scaling video models is a promising direction for developing simulators of the physical and digital world.

The original article: https://openai.com/research/video-generation-models-as-world-simulators