@@ -37,7 +37,7 @@ With Open-Sora, our goal is to foster innovation, creativity, and inclusivity wi
## 📰 News
- **[2025.03.17]** 🔥 We released **Open-Sora 2.0** (11B). with MMDiT structure and optimized for image-to-video generation, the model generates high quality of videos (t2v, i2v, t2i2v) with 256x256 and 768x768 resolution. An attempt to adapt for a high-compression autoencoder is also presented. 😚 All training codes are released! [[report]]()
- **[2025.03.17]** 🔥 We released **Open-Sora 2.0** (11B). With MMDiT structure and optimized for image-to-video generation, the model generates high quality of videos (t2v, i2v, t2i2v) with 256x256 and 768x768 resolution. An attempt to adapt for a high-compression autoencoder is also presented. 😚 All training codes are released! [[report]]()
- **[2025.02.20]** 🔥 We released **Open-Sora 1.3** (1B). With the upgraded VAE and Transformer architecture, the quality of our generated videos has been greatly improved 🚀. [[checkpoints]](#open-sora-13-model-weights) [[report]](/docs/report_04.md) [[demo]](https://huggingface.co/spaces/hpcai-tech/open-sora)
- **[2024.12.23]** The development cost of video generation models has saved by 50%! Open-source solutions are now available with H200 GPU vouchers. [[blog]](https://company.hpc-ai.com/blog/the-development-cost-of-video-generation-models-has-saved-by-50-open-source-solutions-are-now-available-with-h200-gpu-vouchers) [[code]](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/train.py) [[vouchers]](https://colossalai.org/zh-Hans/docs/get_started/bonus/)
- **[2024.06.17]** We released **Open-Sora 1.2**, which includes **3D-VAE**, **rectified flow**, and **score condition**. The video quality is greatly improved. [[checkpoints]](#open-sora-12-model-weights) [[report]](/docs/report_03.md) [[arxiv]](https://arxiv.org/abs/2412.20404)
@@ -58,8 +58,12 @@ With Open-Sora, our goal is to foster innovation, creativity, and inclusivity wi
More samples and corresponding prompts are available in our [Gallery](https://hpcaitech.github.io/Open-Sora/).
During training, we provide motion score into the text prompt. During inference, you can use the following command to generate videos with motion score (the default score is 4):
We also provide a dynamic motion score evaluator. After setting your OpenAI API key, you can use the following command to evaluate the motion score of a video:
We take advantage of ChatGPT to refine the prompt. You can use the following command to refine the prompt. The function is available for both text-to-video and image-to-video generation.
@@ -242,9 +262,11 @@ To make the results reproducible, you can set the random seed by:
Use `--num-sample k` to generate `k` samples for each prompt.
## Computational Efficiency
We test the computational efficiency on H100/H800 GPU. For 256x256, we use colossalai's tensor parallelism. For 768x768, we use colossalai's sequence parallelism. All use number of steps 50. The results are presented in the format: $\color{blue}{\text{Total time (s)}}/\color{red}{\text{peak GPU memory (GB)}}$
We test the computational efficiency of text-to-video on H100/H800 GPU. For 256x256, we use colossalai's tensor parallelism. For 768x768, we use colossalai's sequence parallelism. All use number of steps 50. The results are presented in the format: $\color{blue}{\text{Total time (s)}}/\color{red}{\text{peak GPU memory (GB)}}$
# Step by step to train or finetune your own model
## Installation
Besides from the installation in the main page, you need to install the following packages:
```bash
pip install git+https://github.com/hpcaitech/TensorNVMe.git # requires cmake, for checkpoint saving
pip install pandarallel # for parallel processing
```
## Prepare dataset
The dataset should be presented in a `csv` or `parquet` file. To better illustrate the process, we will use a 45k [pexels dataset](https://huggingface.co/datasets/hpcai-tech/open-sora-pexels-45k) as an example. This dataset contains clipped, score filtered high-quality videos from [Pexels](https://www.pexels.com/).
First, download the dataset to your local machine:
```bash
mkdir datasets
cd datasets
# For Chinese users, export HF_ENDPOINT=https://hf-mirror.com to speed up the download
mv pexels_45k .. # make sure the path is Open-Sora/datasets/pexels_45k
```
There are three `csv` files provided:
- `pexels_45k.csv`: contains only path and text, which needs to be processed for training.
- `pexels_45k_necessary.csv`: contains necessary information for training.
- `pexels_45k_score.csv`: contains score information for each video. The 45k videos are filtered out based on the score. See tech report for more details.
If you want to use custom dataset, at least the following columns are required:
> The process may take a while, depending on the number of videos in the dataset. The process is neccessary for training on arbitrary aspect ratio, resolution, and number of frames.
## Training
The command format to launch training is as follows:
All configs are located in `configs/diffusion/train/`. The following rules are applied:
- `_base_ = ["config_to_inherit"]`: inherit from another config by mmengine's support. Variables are overwritten by the new config. Dictionary is merged if `_delete_` key is not present.
- command line arguments override the config file. For example, `--lr 1e-5` will override the `lr` in the config file. `--dataset.data-path datasets/pexels_45k_necessary.csv` will override the `data-path` value in the dictionary `dataset`.
The `bucket_config` is used to control different training stages. It is a dictionary of dictionaries. The tuple means (sampling probability, batch size). For example:
```python
bucket_config = {
"256px": {
1: (1.0, 45), # for 256px images, use 100% of the data with batch size 45
33: (1.0, 12), # for 256px videos with no less than 33 frames, use 100% of the data with batch size 12
65: (1.0, 6), # for 256px videos with no less than 65 frames, use 100% of the data with batch size 6
97: (1.0, 4), # for 256px videos with no less than 97 frames, use 100% of the data with batch size 4
129: (1.0, 3), # for 256px videos with no less than 129 frames, use 100% of the data with batch size 3
},
"768px": {
1: (0.5, 13), # for 768px images, use 50% of the data with batch size 13
},
"1024px": {
1: (0.5, 7), # for 1024px images, use 50% of the data with batch size 7
},
}
```
We provide the following configs, the batch size is searched on H200 GPUs with 140GB memory:
- `image.py`: train on images only.
- `stage1.py`: train on videos with 256px resolution.
- `stage2.py`: train on videos with 768px resolution with sequence parallelism (default 4).
- `stage1_i2v.py`: train t2v and i2v with 256px resolution.
- `stage2_i2v.py`: train t2v and i2v with 768px resolution.
We also provide a demo config `demo.py` with small batch size for debugging.
To finetune from flux-dev, we provided a transformed flux-dev [ckpts](https://huggingface.co/hpcai-tech/flux1-dev-fused-rope). Download it to `ckpts` and run:
More details are provided in the tech report. If explanation for some techiques is needed, feel free to open an issue.
- Tensor parallelism and sequence parallelism
- Zero 2
- Pin memory organization
- Garbage collection organization
- Data prefetching
- Communication bucket optimization
- Shardformer for T5
### Gradient Checkpointing
We support selective gradient checkpointing to save memory. The `grad_ckpt_setting` is a tuple, the first element is the number of dual layers to apply gradient checkpointing, the second element is the number of single layers to apply full gradient. A very large number will apply full gradient to all layers.
```python
grad_ckpt_setting = (100, 100)
model = dict(
grad_ckpt_setting=grad_ckpt_setting,
)
```
To further save memory, you can offload gradient checkpointing to CPU by:
```python
grad_ckpt_buffer_size = 25 * 1024**3 # 25GB
```
### Asynchronous Checkpoint Saving
With `--async-io True`, the checkpoint will be saved asynchronously with the support of ColossalAI. This will save time for checkpoint saving.
### Dataset
With a very large dataset, the `csv` file or even `parquet` file may be too large to fit in memory. We provide a script to split the dataset into smaller chunks:
@@ -16,6 +16,7 @@ from colossalai.utils.safetensors import save as async_save
from colossalai.zero.low_level import LowLevelZeroOptimizer
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from tensornvme.async_file_io import AsyncFileWriter
from torch.optim import Optimizer
from torch.optim.lr_scheduler import _LRScheduler
@@ -87,6 +88,7 @@ def load_checkpoint(
device_map: torch.device | str = "cpu",
cai_model_name: str = "model",
strict: bool = False,
rename_keys: dict = None, # rename keys in the checkpoint to support fine-tuning with a different model architecture; map old_key_prefix to new_key_prefix
) -> nn.Module:
"""
Loads a checkpoint into model from a path. Support three types of checkpoints:
tuple[nn.Module, nn.Module, nn.Module, nn.Module, dict[str, nn.Module]]: The models. They are the diffusion model, the autoencoder model, the T5 model, the CLIP model, and the optional models.
"""
model_device = "cpu" if offload_model and cfg.get("img_flux", None) is not None else device
model_device = (
"cpu" if offload_model and cfg.get("img_flux", None) is not None else device
)
model = build_module(cfg.model, MODELS, device_map=model_device, torch_dtype=dtype).eval()
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.