Enhance EasyAnimate with Reward Backpropagation (Preference Optimization)

We explore the Reward Backpropagation technique ^{1 2} to optimized the generated videos by EasyAnimateV5 for better alignment with human preferences. We provide pre-trained models (i.e. LoRAs) along with the training script. You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.

Enhance EasyAnimate with Reward Backpropagation (Preference Optimization)

Demo

EasyAnimateV5-12b-zh-InP

Prompt	EasyAnimateV5-12b-zh-InP	EasyAnimateV5-12b-zh-InP HPSv2.1 Reward LoRA	EasyAnimateV5-12b-zh-InP MPS Reward LoRA
Porcelain rabbit hopping by a golden cactus	00000007.mp4	00000007.mp4	00000007.mp4
Yellow rubber duck floating next to a blue bath towel	00000005.mp4	00000005.mp4	00000005.mp4
An elephant sprays water with its trunk, a lion sitting nearby	00000003.mp4	00000003.mp4	00000003.mp4
A fish swims gracefully in a tank as a horse gallops outside	00000002.mp4	00000002.mp4	00000002.mp4

EasyAnimateV5-7b-zh-InP

Prompt	EasyAnimateV5-7b-zh-InP	EasyAnimateV5-7b-zh-InP HPSv2.1 Reward LoRA	EasyAnimateV5-7b-zh-InP MPS Reward LoRA
Crystal cake shimmering beside a metal apple	00000006.mp4	00000006.mp4	00000006.mp4
Elderly artist with a white beard painting on a white canvas	00000005.mp4	00000005.mp4	00000005.mp4
Porcelain rabbit hopping by a golden cactus	00000007.mp4	00000007.mp4	00000007.mp4
Green parrot perching on a brown chair	00000004.mp4	00000004.mp4	00000004.mp4

Note

The above test prompts are from T2V-CompBench. All videos are generated with lora weight 0.7.

Model Zoo

Name	Base Model	Reward Model	Hugging Face	Description
EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors	EasyAnimateV5-12b-zh-InP	HPS v2.1	🤗Link	Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps.
EasyAnimateV5-7b-zh-InP-HPS2.1.safetensors	EasyAnimateV5-7b-zh-InP	HPS v2.1	🤗Link	Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 3,500 steps.
EasyAnimateV5-12b-zh-InP-MPS.safetensors	EasyAnimateV5-12b-zh-InP	MPS	🤗Link	Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps.
EasyAnimateV5-7b-zh-InP-MPS.safetensors	EasyAnimateV5-7b-zh-InP	MPS	🤗Link	Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 2,000 steps.

Inference

We provide an example inference code to run EasyAnimateV5-12b-zh-InP with its HPS2.1 reward LoRA.

import torch
from diffusers import DDIMScheduler
from omegaconf import OmegaConf
from transformers import BertModel, BertTokenizer, T5EncoderModel, T5Tokenizer

from easyanimate.models import AutoencoderKLMagvit, EasyAnimateTransformer3DModel
from easyanimate.pipeline.pipeline_easyanimate_multi_text_encoder_inpaint import EasyAnimatePipeline_Multi_Text_Encoder_Inpaint
from easyanimate.utils.lora_utils import merge_lora
from easyanimate.utils.utils import get_image_to_video_latent, save_videos_grid
from easyanimate.utils.fp8_optimization import convert_weight_dtype_wrapper

# GPU memory mode, which can be choosen in [model_cpu_offload, model_cpu_offload_and_qfloat8, sequential_cpu_offload].
GPU_memory_mode = "model_cpu_offload"
# Download from https://raw.githubusercontent.com/aigc-apps/EasyAnimate/refs/heads/main/config/easyanimate_video_v5_magvit_multi_text_encoder.yaml
config_path = "config/easyanimate_video_v5_magvit_multi_text_encoder.yaml"
model_path = "alibaba-pai/EasyAnimateV5-12b-zh-InP"
lora_path = "alibaba-pai/EasyAnimateV5-Reward-LoRAs/EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors"
weight_dtype = torch.bfloat16
lora_weight = 0.7

prompt = "A panda eats bamboo while a monkey swings from branch to branch"
sample_size = [512, 512]
video_length = 49

config = OmegaConf.load(config_path)
transformer_additional_kwargs = OmegaConf.to_container(config['transformer_additional_kwargs'])
if weight_dtype == torch.float16:
    transformer_additional_kwargs["upcast_attention"] = True
transformer = EasyAnimateTransformer3DModel.from_pretrained_2d(
    model_path, 
    subfolder="transformer",
    transformer_additional_kwargs=transformer_additional_kwargs,
    torch_dtype=torch.float8_e4m3fn if GPU_memory_mode == "model_cpu_offload_and_qfloat8" else weight_dtype,
    low_cpu_mem_usage=True,
)
vae = AutoencoderKLMagvit.from_pretrained(
    model_path, subfolder="vae", vae_additional_kwargs=OmegaConf.to_container(config['vae_kwargs'])
).to(weight_dtype)
if config['vae_kwargs'].get('vae_type', 'AutoencoderKL') == 'AutoencoderKLMagvit' and weight_dtype == torch.float16:
    vae.upcast_vae = True

pipeline = EasyAnimatePipeline_Multi_Text_Encoder_Inpaint.from_pretrained(
    model_path,
    text_encoder=BertModel.from_pretrained(model_path, subfolder="text_encoder").to(weight_dtype),
    text_encoder_2=T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder_2").to(weight_dtype),
    tokenizer=BertTokenizer.from_pretrained(model_path, subfolder="tokenizer"),
    tokenizer_2=T5Tokenizer.from_pretrained(model_path, subfolder="tokenizer_2"),
    vae=vae,
    transformer=transformer,
    scheduler=DDIMScheduler.from_pretrained(model_path, subfolder="scheduler"),
    torch_dtype=weight_dtype
)
if GPU_memory_mode == "sequential_cpu_offload":
    pipeline.enable_sequential_cpu_offload()
elif GPU_memory_mode == "model_cpu_offload_and_qfloat8":
    pipeline.enable_model_cpu_offload()
    convert_weight_dtype_wrapper(pipeline.transformer, weight_dtype)
else:
    pipeline.enable_model_cpu_offload()
pipeline = merge_lora(pipeline, lora_path, lora_weight)

generator = torch.Generator(device="cuda").manual_seed(42)
input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
sample = pipeline(
    prompt, 
    video_length = video_length,
    negative_prompt = "bad detailed",
    height = sample_size[0],
    width = sample_size[1],
    generator = generator,
    guidance_scale = 7.0,
    num_inference_steps = 50,
    video = input_video,
    mask_video = input_video_mask,
).videos

save_videos_grid(sample, "samples/output.mp4", fps=8)

Training

The training code is based on train_lora.py. We provide a shell script to train the HPS v2.1 reward LoRA for EasyAnimateV5-12b-zh-InP.

Setup

Please read the quick-start section to setup the CogVideoX-Fun environment. If you're playing with HPS reward model, please run the following script to install the dependencies:

# For HPS reward model only
pip install hpsv2
site_packages=$(python -c "import site; print(site.getsitepackages()[0])")
wget -O $site_packages/hpsv2/src/open_clip/factory.py https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/easyanimate/package/patches/hpsv2_src_open_clip_factory_patches.py
wget -O $site_packages/hpsv2/src/open_clip/ https://github.com/tgxs002/HPSv2/raw/refs/heads/master/hpsv2/src/open_clip/bpe_simple_vocab_16e6.txt.gz

Note

Since some models will be downloaded automatically from HuggingFace, Please run HF_ENDPOINT=https://hf-mirror.com sh scripts/train_reward_lora.sh if you cannot access to huggingface.com.

Important Args

rank: The size of LoRA model. The higher the LoRA rank, the more parameters it has, and the more it can learn (including some unnecessary information). Bt default, we set the rank to 128. You can lower this value to reduce training GPU memory and the LoRA file size.
network_alpha: A scaling factor changes how the LoRA affect the base model weight. In general, it can be set to half of the rank.
prompt_path: The path to the prompt file (in txt format, each line is a prompt) for sampling training videos. We randomly selected 701 prompts from MovieGenBench.
train_sample_height and train_sample_width: The resolution of the sampled training videos. We found training at a 256x256 resolution can generalize to any other resolution. Reducing the resolution can save GPU memory during training, but it is recommended that the resolution should be equal to or greater than the image input resolution of the reward model. Due to the resize and crop preprocessing operations, we suggest using a 1:1 aspect ratio.
reward_fn and reward_fn_kwargs: The reward model name and its keyword arguments. All supported reward models (Aesthetic Predictor v2/v2.5, HPS v2/v2.1, PickScore and MPS) can be found in reward_fn.py. You can also customize your own reward model (e.g., combining aesthetic predictor with HPS).
num_decoded_latents and num_sampled_frames: The number of decoded latents (for VAE) and sampled frames (for the reward model). Since CogVideoX-Fun adopts the 3D casual VAE, we found decoding only the first latent to obtain the first frame for computing the reward not only reduces training memory usage but also prevents excessive reward optimization and maintains the dynamics of generated videos.

Limitations

We observe after training to a certain extent, the reward continues to increase, but the quality of the generated videos does not further improve. The model trickly learns some shortcuts (by adding artifacts in the background, i.e., adversarial patches) to increase the reward.
Currently, there is still a lack of suitable preference models for video generation. Directly using image preference models cannot evaluate preferences along the temporal dimension (such as dynamism and consistency). Further more, We find using image preference models leads to a decrease in the dynamism of generated videos. Although this can be mitigated by computing the reward using only the first frame of the decoded video, the impact still persists.

References

Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards.". In ICLR 2024.
Prabhudesai, Mihir, et al. "Aligning text-to-image diffusion models with reward backpropagation." arXiv preprint arXiv:2310.03739 (2023).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_TRAIN_REWARD.md

README_TRAIN_REWARD.md

Enhance EasyAnimate with Reward Backpropagation (Preference Optimization)

Demo

EasyAnimateV5-12b-zh-InP

EasyAnimateV5-7b-zh-InP

Model Zoo

Inference

Training

Setup

Important Args

Limitations

References

Files

README_TRAIN_REWARD.md

Latest commit

History

README_TRAIN_REWARD.md

File metadata and controls

Enhance EasyAnimate with Reward Backpropagation (Preference Optimization)

Demo

EasyAnimateV5-12b-zh-InP

EasyAnimateV5-7b-zh-InP

Model Zoo

Inference

Training

Setup

Important Args

Limitations

References