SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules
Guardado en:
| Udgivet i: | arXiv.org (Dec 6, 2024), p. n/a |
|---|---|
| Hovedforfatter: | |
| Andre forfattere: | , , , , , , , , , , , , , |
| Udgivet: |
Cornell University Library, arXiv.org
|
| Fag: | |
| Online adgang: | Citation/Abstract Full text outside of ProQuest |
| Tags: |
Ingen Tags, Vær først til at tagge denne postø!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3075441898 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2331-8422 | ||
| 035 | |a 3075441898 | ||
| 045 | 0 | |b d20241206 | |
| 100 | 1 | |a Li, Suyi | |
| 245 | 1 | |a SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules | |
| 260 | |b Cornell University Library, arXiv.org |c Dec 6, 2024 | ||
| 513 | |a Working Paper | ||
| 520 | 3 | |a Text-to-image (T2I) generation using diffusion models has become a blockbuster service in today's AI cloud. A production T2I service typically involves a serving workflow where a base diffusion model is augmented with various "add-on" modules, notably ControlNet and LoRA, to enhance image generation control. Compared to serving the base model alone, these add-on modules introduce significant loading and computational overhead, resulting in increased latency. In this paper, we present SwiftDiffusion, a system that efficiently serves a T2I workflow through a holistic approach. SwiftDiffusion decouples ControNet from the base model and deploys it as a separate, independently scaled service on dedicated GPUs, enabling ControlNet caching, parallelization, and sharing. To mitigate the high loading overhead of LoRA serving, SwiftDiffusion employs a bounded asynchronous LoRA loading (BAL) technique, allowing LoRA loading to overlap with the initial base model execution by up to k steps without compromising image quality. Furthermore, SwiftDiffusion optimizes base model execution with a novel latent parallelism technique. Collectively, these designs enable SwiftDiffusion to outperform the state-of-the-art T2I serving systems, achieving up to 7.8x latency reduction and 1.6x throughput improvement in serving SDXL models on H800 GPUs, without sacrificing image quality. | |
| 653 | |a Parallel processing | ||
| 653 | |a Modules | ||
| 653 | |a Image quality | ||
| 653 | |a Image processing | ||
| 653 | |a Workflow | ||
| 653 | |a Effectiveness | ||
| 700 | 1 | |a Yang, Lingyun | |
| 700 | 1 | |a Jiang, Xiaoxiao | |
| 700 | 1 | |a Lu, Hanfeng | |
| 700 | 1 | |a An, Dakai | |
| 700 | 1 | |a Zhipeng Di | |
| 700 | 1 | |a Lu, Weiyi | |
| 700 | 1 | |a Chen, Jiawei | |
| 700 | 1 | |a Liu, Kan | |
| 700 | 1 | |a Yu, Yinghao | |
| 700 | 1 | |a Tao, Lan | |
| 700 | 1 | |a Yang, Guodong | |
| 700 | 1 | |a Qu, Lin | |
| 700 | 1 | |a Zhang, Liping | |
| 700 | 1 | |a Wang, Wei | |
| 773 | 0 | |t arXiv.org |g (Dec 6, 2024), p. n/a | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3075441898/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch |
| 856 | 4 | 0 | |3 Full text outside of ProQuest |u http://arxiv.org/abs/2407.02031 |