SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

Guardado en:

Bibliografiske detaljer
Udgivet i:	arXiv.org (Dec 6, 2024), p. n/a
Hovedforfatter:	Li, Suyi
Andre forfattere:	Yang, Lingyun, Jiang, Xiaoxiao, Lu, Hanfeng, An, Dakai, Zhipeng Di, Lu, Weiyi, Chen, Jiawei, Liu, Kan, Yu, Yinghao, Tao, Lan, Yang, Guodong, Qu, Lin, Zhang, Liping, Wang, Wei
Udgivet:	Cornell University Library, arXiv.org
Fag:	Parallel processing Modules Image quality Image processing Workflow Effectiveness
Online adgang:	Citation/Abstract Full text outside of ProQuest
Tags:	Tilføj Tag Ingen Tags, Vær først til at tagge denne postø!

MARC


LEADER	00000nab a2200000uu 4500
001	3075441898
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3075441898
045	0		\|b d20241206
100	1		\|a Li, Suyi
245	1		\|a SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules
260			\|b Cornell University Library, arXiv.org \|c Dec 6, 2024
513			\|a Working Paper
520	3		\|a Text-to-image (T2I) generation using diffusion models has become a blockbuster service in today's AI cloud. A production T2I service typically involves a serving workflow where a base diffusion model is augmented with various "add-on" modules, notably ControlNet and LoRA, to enhance image generation control. Compared to serving the base model alone, these add-on modules introduce significant loading and computational overhead, resulting in increased latency. In this paper, we present SwiftDiffusion, a system that efficiently serves a T2I workflow through a holistic approach. SwiftDiffusion decouples ControNet from the base model and deploys it as a separate, independently scaled service on dedicated GPUs, enabling ControlNet caching, parallelization, and sharing. To mitigate the high loading overhead of LoRA serving, SwiftDiffusion employs a bounded asynchronous LoRA loading (BAL) technique, allowing LoRA loading to overlap with the initial base model execution by up to k steps without compromising image quality. Furthermore, SwiftDiffusion optimizes base model execution with a novel latent parallelism technique. Collectively, these designs enable SwiftDiffusion to outperform the state-of-the-art T2I serving systems, achieving up to 7.8x latency reduction and 1.6x throughput improvement in serving SDXL models on H800 GPUs, without sacrificing image quality.
653			\|a Parallel processing
653			\|a Modules
653			\|a Image quality
653			\|a Image processing
653			\|a Workflow
653			\|a Effectiveness
700	1		\|a Yang, Lingyun
700	1		\|a Jiang, Xiaoxiao
700	1		\|a Lu, Hanfeng
700	1		\|a An, Dakai
700	1		\|a Zhipeng Di
700	1		\|a Lu, Weiyi
700	1		\|a Chen, Jiawei
700	1		\|a Liu, Kan
700	1		\|a Yu, Yinghao
700	1		\|a Tao, Lan
700	1		\|a Yang, Guodong
700	1		\|a Qu, Lin
700	1		\|a Zhang, Liping
700	1		\|a Wang, Wei
773	0		\|t arXiv.org \|g (Dec 6, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3075441898/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2407.02031