Overview
Kandinsky 5.0 uses a latent diffusion pipeline with Flow Matching and features:- Diffusion Transformer (DiT): Main generative backbone with cross-attention to text embeddings
- Qwen2.5-VL and CLIP: Provides high-quality text embeddings
- HunyuanVideo 3D VAE: Encodes and decodes video into a latent space
- SFT model: Highest generation quality
- CFG-distilled: 2× faster inference
- Diffusion-distilled: 6× faster with minimal quality loss (16 steps)
- Pretrain model: Designed for fine-tuning
Model variants
| Model | Video Duration | NFE | Latency (H100) |
|---|---|---|---|
| Kandinsky 5.0 T2V Lite SFT | 5s / 10s | 100 | 139s / 224s |
| Kandinsky 5.0 T2V Lite no-CFG | 5s / 10s | 50 | 77s / 124s |
| Kandinsky 5.0 T2V Lite distill | 5s / 10s | 16 | 35s / 61s |
| Kandinsky 5.0 I2V Lite | 5s | 100 | 673s |
Text-to-Video workflow
1. Download workflow file
Please update your ComfyUI to the latest version, and through the menuWorkflow -> Browse Templates -> Video, find “Kandinsky 5.0 T2V” to load the workflow.
Download Workflow
Download the JSON workflow file for local use
2. Manually download models
Text Encoders Diffusion Model VAEImage-to-Video workflow
1. Download workflow file
Please update your ComfyUI to the latest version, and through the menuWorkflow -> Browse Templates -> Video, find “Kandinsky 5.0 I2V” to load the workflow.
Download Workflow
Download the JSON workflow file for local use