SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

Guardado en:
Bibliografiske detaljer
Udgivet i:arXiv.org (Dec 6, 2024), p. n/a
Hovedforfatter: Li, Suyi
Andre forfattere: Yang, Lingyun, Jiang, Xiaoxiao, Lu, Hanfeng, An, Dakai, Zhipeng Di, Lu, Weiyi, Chen, Jiawei, Liu, Kan, Yu, Yinghao, Tao, Lan, Yang, Guodong, Qu, Lin, Zhang, Liping, Wang, Wei
Udgivet:
Cornell University Library, arXiv.org
Fag:
Online adgang:Citation/Abstract
Full text outside of ProQuest
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!

MARC

LEADER 00000nab a2200000uu 4500
001 3075441898
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3075441898 
045 0 |b d20241206 
100 1 |a Li, Suyi 
245 1 |a SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules 
260 |b Cornell University Library, arXiv.org  |c Dec 6, 2024 
513 |a Working Paper 
520 3 |a Text-to-image (T2I) generation using diffusion models has become a blockbuster service in today's AI cloud. A production T2I service typically involves a serving workflow where a base diffusion model is augmented with various "add-on" modules, notably ControlNet and LoRA, to enhance image generation control. Compared to serving the base model alone, these add-on modules introduce significant loading and computational overhead, resulting in increased latency. In this paper, we present SwiftDiffusion, a system that efficiently serves a T2I workflow through a holistic approach. SwiftDiffusion decouples ControNet from the base model and deploys it as a separate, independently scaled service on dedicated GPUs, enabling ControlNet caching, parallelization, and sharing. To mitigate the high loading overhead of LoRA serving, SwiftDiffusion employs a bounded asynchronous LoRA loading (BAL) technique, allowing LoRA loading to overlap with the initial base model execution by up to k steps without compromising image quality. Furthermore, SwiftDiffusion optimizes base model execution with a novel latent parallelism technique. Collectively, these designs enable SwiftDiffusion to outperform the state-of-the-art T2I serving systems, achieving up to 7.8x latency reduction and 1.6x throughput improvement in serving SDXL models on H800 GPUs, without sacrificing image quality. 
653 |a Parallel processing 
653 |a Modules 
653 |a Image quality 
653 |a Image processing 
653 |a Workflow 
653 |a Effectiveness 
700 1 |a Yang, Lingyun 
700 1 |a Jiang, Xiaoxiao 
700 1 |a Lu, Hanfeng 
700 1 |a An, Dakai 
700 1 |a Zhipeng Di 
700 1 |a Lu, Weiyi 
700 1 |a Chen, Jiawei 
700 1 |a Liu, Kan 
700 1 |a Yu, Yinghao 
700 1 |a Tao, Lan 
700 1 |a Yang, Guodong 
700 1 |a Qu, Lin 
700 1 |a Zhang, Liping 
700 1 |a Wang, Wei 
773 0 |t arXiv.org  |g (Dec 6, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3075441898/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2407.02031