Structural Insights for LLM Serving Efficiency

Guardado en:

Detalles Bibliográficos
Publicado en:	ProQuest Dissertations and Theses (2025)
Autor principal:	Patel, Pratyush
Publicado:	ProQuest Dissertations & Theses
Materias:	Computer science Computer engineering Information technology
Acceso en línea:	Citation/Abstract Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC


LEADER	00000nab a2200000uu 4500
001	3251644012
003	UK-CbPIL
020			\|a 9798293847723
035			\|a 3251644012
045	2		\|b d20250101 \|b d20251231
084			\|a 66569 \|2 nlm
100	1		\|a Patel, Pratyush
245	1		\|a Structural Insights for LLM Serving Efficiency
260			\|b ProQuest Dissertations & Theses \|c 2025
513			\|a Dissertation/Thesis
520	3		\|a The widespread adoption of Large Language Models (LLMs) has reshaped the datacenter computing landscape. As these models continue to grow in size and complexity, they require increasingly expensive and power-intensive infrastructure. Hence, serving LLMs efficiently has become critical for managing costs and resource constraints in modern datacenters. In this dissertation, I argue that serving efficiency can be significantly improved by designing systems that are aware of the distinct phases of generative LLM inference: a compute-intensive prefill phase and a memory-intensive decode phase. Exploiting the unique properties of these phases unlocks significant performance gains at scale. My research validates this thesis through three studies. First, I address power constraints, a key bottleneck to datacenter growth. By analyzing how the distinct power demands of prefill and decode phases aggregate, I show that inference cluster power is underutilized. Based on this observation, I develop a power oversubscription framework that safely adds more servers under existing power budgets, increasing inference cluster capacity with minimal performance impact. Second, I show that running the compute-bound prefill and memory-bound decode phases on the same hardware leads to poor performance and resource stranding. To address these overheads, I introduce a new inference cluster architecture that disaggregates the phases onto hardware fleets specialized to better manage resources for each phase. This phase-separated cluster design yields substantial efficiency improvements over traditional approaches. Third, I extensively analyze the unique inefficiencies caused by conditional computation in Mixture-of-Experts (MoE) models, which I formalize as the MoE tax. This tax manifests differently across the two phases, for instance, creating load imbalance in prefill and increased memory transfers in decode. Based on this analysis, I propose phase-specific optimizations to address these bottlenecks and improve the efficiency of serving MoE models at scale. Collectively, these studies demonstrate that phase awareness is a key principle for designing efficient generative LLM serving systems.
653			\|a Computer science
653			\|a Computer engineering
653			\|a Information technology
773	0		\|t ProQuest Dissertations and Theses \|g (2025)
786	0		\|d ProQuest \|t ProQuest Dissertations & Theses Global
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3251644012/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch
856	4	0	\|3 Full Text - PDF \|u https://www.proquest.com/docview/3251644012/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch