SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

Gardado en:
Detalles Bibliográficos
Publicado en:arXiv.org (Dec 6, 2024), p. n/a
Autor Principal: Li, Suyi
Outros autores: Yang, Lingyun, Jiang, Xiaoxiao, Lu, Hanfeng, An, Dakai, Zhipeng Di, Lu, Weiyi, Chen, Jiawei, Liu, Kan, Yu, Yinghao, Tao, Lan, Yang, Guodong, Qu, Lin, Zhang, Liping, Wang, Wei
Publicado:
Cornell University Library, arXiv.org
Materias:
Acceso en liña:Citation/Abstract
Full text outside of ProQuest
Etiquetas: Engadir etiqueta
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!
Descripción
Resumo:Text-to-image (T2I) generation using diffusion models has become a blockbuster service in today's AI cloud. A production T2I service typically involves a serving workflow where a base diffusion model is augmented with various "add-on" modules, notably ControlNet and LoRA, to enhance image generation control. Compared to serving the base model alone, these add-on modules introduce significant loading and computational overhead, resulting in increased latency. In this paper, we present SwiftDiffusion, a system that efficiently serves a T2I workflow through a holistic approach. SwiftDiffusion decouples ControNet from the base model and deploys it as a separate, independently scaled service on dedicated GPUs, enabling ControlNet caching, parallelization, and sharing. To mitigate the high loading overhead of LoRA serving, SwiftDiffusion employs a bounded asynchronous LoRA loading (BAL) technique, allowing LoRA loading to overlap with the initial base model execution by up to k steps without compromising image quality. Furthermore, SwiftDiffusion optimizes base model execution with a novel latent parallelism technique. Collectively, these designs enable SwiftDiffusion to outperform the state-of-the-art T2I serving systems, achieving up to 7.8x latency reduction and 1.6x throughput improvement in serving SDXL models on H800 GPUs, without sacrificing image quality.
ISSN:2331-8422
Fonte:Engineering Database