We follow the same steps as the original finetrainers to prepare the RGBA dataset. For RGBA dataset, you can follow the instructions above to preprocess the dataset yourself.
Here are some detailed steps to prepare the dataset for Mochi-1 fine-tuning:
- Download our preprocessed Video RGBA dataset, which has undergone preprocessing operations such as color decontamination and background blur.
- Use
trim_and_crop_videos.py
to crop and trim the RGB and Alpha videos as needed. - Use
embed.py
to encode the RGB videos into latent representations and embed the video captions into embeddings. - Use
embed.py
to encode the Alpha videos into latent representations. - Concatenate the RGB and Alpha latent representations along the frames dimension.
Finally, the dataset should be in the following format:
<video_1_concatenated>.latent.pt
<video_1_captions>.embed.pt
<video_2_concatenated>.latent.pt
<video_2_captions>.embed.pt
Now, we're ready to fine-tune. To launch, run:
bash train.sh
Note:
The arg --num_frames
is used to specify the number of frames of generated RGB video. During generation, we will actually double the number of frames to generate the RGB video and Alpha video jointly. This double operation is automatically handled by our implementation.
For an 80GB GPU, we support processing RGB videos with dimensions of 480 × 848 × 79 (Height × Width × Frames) at a batch size of 1 using bfloat16 precision for training. However, the training is relatively slow (over one minute per iteration) because the model processes a total of 79 × 2 frames as input.
We haven't rigorously tested but without validation enabled, this script should run under 40GBs of GPU VRAM.
To generate the RGBA video, run:
python cli.py \
--lora_path /path/to/lora \
--prompt "..." \
This command generates the RGB and Alpha videos simultaneously and saves them. Specifically, the RGB video is saved in its premultiplied form. To blend this video with any background image, you can simply use the following formula:
com = rgb + (1 - alpha) * bgr
(Contributions are welcome 🤗)
Our script currently doesn't leverage accelerate
and some of its consequences are detailed below:
- No support for distributed training.
train_batch_size > 1
are supported but can potentially lead to OOMs because we currently don't have gradient accumulation support.- No support for 8bit optimizers (but should be relatively easy to add).
Misc:
- We're aware of the quality issues in the
diffusers
implementation of Mochi-1. This is being fixed in this PR. embed.py
script is non-batched.