This technical report focuses on (1) how to transform any type of visual data into a unified representation that enables large-scale training of generative models and (2) a qualitative assessment of Sora’s capabilities and limitations. Model and implementation details are not included in this report.
Previous research has studied generative modeling of video data using various methods, including recurrent networks.[^1][^2][^3] generative adversarial network,[^4][^5][^6][^7] autoregressive transformer,[^8][^9] Diffusion model.[^10][^11][^12] These works often focus on narrow categories of visual data, short videos or fixed-size videos. Sora is a generic model for visual data. You can create videos and images up to one minute of high-definition video across a variety of durations, aspect ratios, and resolutions.