Sora介绍作者： AINLP 来源： AINLP 春节假期最受关注的模型就是OpenAI release的Sora了，由于目前技术原理没有开放，且仅开放给部分用户用于测试，本文仅介绍官方技术文档所发布的信息。 https://openai.com/research/video-generation-models-as-world-simulators Video generation models as world simulators 视频生成模型作为世界模拟器 ❝ Sora 是一种文本生成视频的模型

Sora介绍

By AiBard123
February 23, 2024 - 2 min read

作者： AINLP 来源： AINLP

春节假期最受关注的模型就是OpenAI release的Sora了，由于目前技术原理没有开放，且仅开放给部分用户用于测试，本文仅介绍官方技术文档所发布的信息。

https://openai.com/research/video-generation-models-as-world-simulators

Video generation models as world simulators 视频生成模型作为世界模拟器

❝

Sora 是一种文本生成视频的模型（text to video），这里称为world simulators

We explore large-scale training of generative models on video data. Specifically, we train text-conditionaldiffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage atransformer architecture that operates onspacetime patches of video and image latent codes . Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

我们探索了大量的生成模型训练在视频数据上。具体来说，我们在可变持续时间、分辨率和宽高比的视频和图像上联合训练文本条件扩散模型。我们利用了transformer架构，该架构在视频和图像latent codes的spacetime patches上运行。我们最大的模型Sora能够生成一分钟的高保真视频。我们的结果表明，扩展视频生成模型是构建物理世界通用模拟器的有前途的途径。

❝

这部分内容是overview，我们可以了解到如下信息

sora是一种diffusion mode

sora使用了spacetime patches技术

Sora可以生成一分钟高质量视频

作者团队认为scaling video generation models是向构建物理世界通用的模拟器的有希望的道路，除了文本生成视频，将来计划为真实世界建模。

这篇技术报告主要关注：

(1) 将各种类型的视觉数据转化为统一的表示形式，进而实现生成模型的大规模训练；
(2) 对Sora模型能力和局限性的定性评价。模型和实现细节未包含在本报告中 。

prior work

以前关于视频数据的生成模型的研究包括：

recurrent networks
generative adversarial networks
autoregressive transformers
diffusion models

这些工作通常focus on narrow category of visual data, on shorter videos, or on videos of a fixed size.

Sora是视觉数据的通用模型——它可以生成跨越不同持续时间、宽高比和分辨率的视频和图像，长达一分钟的高清视频。

overview

Turning visual data into patches

我们从LLM中获得灵感，这些模型通过在internet-scale data上进行训练来获得通用能力。LLM范式的成功部分归功于使用token，这些token优雅地统一了文本的不同模式，code, math and various natural languages。

在这项工作中，我们考虑了视觉数据的生成模型如何继承这些好处。LLM有text tokens，而Sora有visual patches 。patches以前已被证明是视觉数据模型的有效表示。我们发现patches是在不同类型的视频和图像上训练生成模型的高度可扩展和有效的表示 。

在高层次来理解，将视频转换为patches，首先将视频压缩到较低维的潜在空间（lower-dimensional latent space），并随后将representation分解为spacetime patches。

Video compression network

我们训练了一个网络来降低视觉数据的维度，网络输入为raw video，输出为时间和空间上都压缩的latent representation。Sora就说在这个压缩后的latent space进行训练并生成视频。我们同时也相应训练了一个解码器模型，将生成的latent表示map回像素空间（pixel space）

Spacetime latent patches

给定一个压缩的input video，我们从中抽取a sequence of spacetime patches，作为transformer的token。这个方法对于images也有效，因为images are just videos with a single frame.

我们的patch-based representation让Sora可以在各种分辨率，时长，宽高比（resolutions, durations and aspect ratios）的videos and images进行训练。

在推理的时候，我们可以控制生成视频的大小通过在合适大小的网格中arranging randomly-initialized patches

Scaling transformers for video generation

Sora是一个扩散模型（diffusion model），给定noisy patches (and conditioning information like text prompts)作为输入，用于训练来预测原始的“清晰”patches（predict the original “clean” patches）

❝

类似扩散模型，通过噪点图片最终生成清晰的图片

重要的是，Sora is a diffusion transformer ，transformer已经展示出其有显著的扩展特性（scaling properties）在广泛的领域，包括language modeling computer vision and image generation.

在这个工作中，我们发现diffusion transformers scale对于video model同样有效。此外还展示了一个video samples的比较，使用了固定的种子和输入在训练过程中，随着训练计算的增加，sample质量有明显的提升

base compute

4x compute

32x compute

Variable durations, resolutions, aspect ratios

以前的方法对于图像和视频的生成主要是重新调整大小，剪切或者裁剪视频到标准大小，如4秒到256x256分辨率。我们发现在原始大小的视频上训练会带来一些优势

Sampling flexibility 灵活采样

Sora可以采样宽屏1920x1080p视频、竖屏1080x1920视频以及介于两者之间的所有视频，这是的Sora能够直接创造内容为不同的设备，满足自身的宽高比，也可以让我们快速生成小尺寸的原型内容（prototype content）在生成全分辨率之前。所有这些都是有相同的模型。

Improved framing and composition 改进的取景与构图

我们发现，在原始宽高比（native aspect ratios）的视频上训练可以提升构图（composition）和取景（framing）。我们比较了Sora与另一个版本，这个版本的训练数据全部裁剪为方形（这是训练生成模型的常用做法）。

我们发现用方形裁剪视频训练的模型（下图左）有时会生成主题部分仅仅部分在视野中（the subject is only partially in view），相比较Sora的结果（下图右），在取景上有了提升。

Language understanding

训练文本生成视频的系统需要大量的视频且有相应的文本描述（text captions），我们应用了DALLE 3引入的re-captioning技术 。首先训练一个高度描述性的captioner model，然后用它来生成全部训练集视频的文本描述。我们发现，对高度描述性的视频描述进行训练可以提高文本保真度以及视频的整体质量。

与DALL·E 3类似，我们还利用GPT将short user prompts 转化为 longer detailed captions传递给视频模型。这使Sora能够生成准确遵循用户提示的高质量视频

Prompting with images and videos

所有上面的结果和主页展示的都是文本到视频生成的sample，但是Sora也可以通过其它方式来提示，比如已经存在的图片或者视频，这种能力使得Sora能够执行广泛的图像和视频编辑任务，如创建完美循环的视频、为静态图像添加动画、在时间轴上向前或向后扩展视频等。

Animating DALL·E images

Sora能够生诚视频，通过将图片和提示作为输入，下面展示了通过DALLE2和DALLE3图片来生成视频的样本

Extending generated videos

Sora 也可以在时间维度上向前或向后扩展视频，下面的例子展示了从生成视频的一个片段在时间上向后扩展，结果显示，每个视频起始部分不同，但结局却一样

我们也可以使用这个方法同时向前和先后扩展视频，进而产生一个无限循环视频

Video-to-video editing

扩散模型已经实现了通过文本提示编辑视频和图片的多种方法，下面我们应用SDEdit方法在Sora中，这个技术可以让Sora将输入视频的风格和环境迁移通过Zero-shot

Connecting videos

我们也可以使用Sora连接两个不同的输入视频，在不同主体和场景的构图视频之间创建无缝过渡。

Image generation capabilities

Sora也能够生成图像。通过在一帧时间范围内的空间网格（spatial grid）中排列高斯噪声块（patches of Gaussian noise）来实现。Sora模型可以生成不同大小的图像——最高可达2048x2048分辨率。

Emerging simulation capabilities 涌现的模拟能力

我们发现视频模型在scale训练后展示出了很多有意思的涌现能力，这些能力让Sora有能力模拟物理时间的人，动物的一些方面。这些特性的出现在对3D，物体等没有任何明确的归纳偏差，仅仅是一种规模效应。

3D consistency一致性

Sora可以生成随着动态摄影机运动的视频，随着摄影机的切换和旋转，人和景物元素在三维空间中一致移动

Long-range coherence and object permanence 远程相干性和物体持久性

视频生成系统的一个明显的挑战是在长视频中维持时空一致性，我们发现，Sora通常能够有效地建模短期和长期依赖关系，但并非总是如此。

例如，我们的模型可以持久化人、动物和物体，即使它们没出现或离开画面。同样，它可以在单个样本中生成同一角色的多个镜头，并在整个视频中保持它们的外观。

Interacting with the world

Sora有时可以用简单的方法模拟真实世界状态影响的动作，例如，画家可以在画布上留下新的笔触，并且随着时间的推移而持续存在；或者一个人吃汉堡时也能在上面留下咬痕。

Simulating digital worlds

Sora也可以模拟人工过程，比如视频游戏，Sora可以同时通过基本策略控制《我的世界》中的玩家，同时以高保真度渲染世界及其动态。这些功能都无需额外的训练数据或调整模型参数，只需向Sora提及“我的世界”相关的提示词即可零样本（zero-shot）实现。

这些能力表明，视频模型的持续扩展（continued scaling）是开发物理和数字世界以及生活在其中的物体、动物和人的高性能模拟器的一条充满希望的道路。

Discussion

Sora的局限性，比如无法建模物理时间的一些基本的交互，比如玻璃杯破碎

References

Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhudinov. “Unsupervised learning of video representations using lstms.” International conference on machine learning. PMLR, 2015.
Chiappa, Silvia, et al. “Recurrent environment simulators.” arXiv preprint arXiv:1704.02254 (2017).
Ha, David, and Jürgen Schmidhuber. “World models.” arXiv preprint arXiv:1803.10122 (2018).
Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. “Generating videos with scene dynamics.” Advances in neural information processing systems 29 (2016).
Tulyakov, Sergey, et al. “Mocogan: Decomposing motion and content for video generation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
Clark, Aidan, Jeff Donahue, and Karen Simonyan. “Adversarial video generation on complex datasets.” arXiv preprint arXiv:1907.06571 (2019).
Brooks, Tim, et al. “Generating long videos of dynamic scenes.” Advances in Neural Information Processing Systems 35 (2022): 31769-31781.
Yan, Wilson, et al. “Videogpt: Video generation using vq-vae and transformers.” arXiv preprint arXiv:2104.10157 (2021).
Wu, Chenfei, et al. “Nüwa: Visual synthesis pre-training for neural visual world creation.” European conference on computer vision. Cham: Springer Nature Switzerland, 2022.
Ho, Jonathan, et al. “Imagen video: High definition video generation with diffusion models.” arXiv preprint arXiv:2210.02303 (2022).
Blattmann, Andreas, et al. “Align your latents: High-resolution video synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
Gupta, Agrim, et al. “Photorealistic video generation with diffusion models.” arXiv preprint arXiv:2312.06662 (2023).
Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.
Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
Arnab, Anurag, et al. “Vivit: A video vision transformer.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
He, Kaiming, et al. “Masked autoencoders are scalable vision learners.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
Dehghani, Mostafa, et al. “Patch n’Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution.” arXiv preprint arXiv:2307.06304 (2023).
Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).
Sohl-Dickstein, Jascha, et al. “Deep unsupervised learning using nonequilibrium thermodynamics.” International conference on machine learning. PMLR, 2015.
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.
Nichol, Alexander Quinn, and Prafulla Dhariwal. “Improved denoising diffusion probabilistic models.” International Conference on Machine Learning. PMLR, 2021.
Dhariwal, Prafulla, and Alexander Quinn Nichol. “Diffusion Models Beat GANs on Image Synthesis.” Advances in Neural Information Processing Systems. 2021.
Karras, Tero, et al. “Elucidating the design space of diffusion-based generative models.” Advances in Neural Information Processing Systems 35 (2022): 26565-26577.
Peebles, William, and Saining Xie. “Scalable diffusion models with transformers.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
Chen, Mark, et al. “Generative pretraining from pixels.” International conference on machine learning. PMLR, 2020.
Ramesh, Aditya, et al. “Zero-shot text-to-image generation.” International Conference on Machine Learning. PMLR, 2021.
Yu, Jiahui, et al. “Scaling autoregressive models for content-rich text-to-image generation.” arXiv preprint arXiv:2206.10789 2.3 (2022): 5.
Betker, James, et al. “Improving image generation with better captions.” Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf 2.3 (2023): 8
Ramesh, Aditya, et al. “Hierarchical text-conditional image generation with clip latents.” arXiv preprint arXiv:2204.06125 1.2 (2022): 3.
Meng, Chenlin, et al. “Sdedit: Guided image synthesis and editing with stochastic differential equations.” arXiv preprint arXiv:2108.01073 (2021).

进技术交流群请添加AINLP小助手微信（id: ainlp2)

请备注具体方向+所用到的相关技术点

![](https://api.allorigins.win/raw?url=https://mmbiz.qpic.cn/mmbiz_jpg/nW2ZPfuYqSJADkmZ2IX6Z23znAibuEevotDMq9iaMxiapK7jfMibiauGFkycicAJEs6x5U9SGyDJZ0S1tRed9TPNUUDQ/640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1)

关于AINLP

AINLP 是一个有趣有AI的自然语言处理社区，专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享，主题包括LLM、预训练模型、自动生成、文本摘要、智能问答、聊天机器人、机器翻译、知识图谱、推荐系统、计算广告、招聘信息、求职经验分享等，欢迎关注！加技术交流群请添加AINLP小助手微信(id：ainlp2)，备注工作/研究方向+加群目的。

  


![](https://api.allorigins.win/raw?url=https://mmbiz.qpic.cn/mmbiz_jpg/nW2ZPfuYqSKABHCqVVQkVYPrM4XY1vsd0iaeuXzyJnoFc8cibd5mYb4wdA3WMQtiaPVmr0XLZHMuVibqWncibpnTSnQ/640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1&wx_co=1)

更多AI工具，参考Github-AiBard123，国内AiBard123

可关注我们的公众号：每天AI新工具