Structural Insights for LLM Serving Efficiency

Guardado en:

Detalles Bibliográficos
Publicado en:	ProQuest Dissertations and Theses (2025)
Autor principal:	Patel, Pratyush
Publicado:	ProQuest Dissertations & Theses
Materias:	Computer science Computer engineering Information technology
Acceso en línea:	Citation/Abstract Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Descripción
Resumen:	The widespread adoption of Large Language Models (LLMs) has reshaped the datacenter computing landscape. As these models continue to grow in size and complexity, they require increasingly expensive and power-intensive infrastructure. Hence, serving LLMs efficiently has become critical for managing costs and resource constraints in modern datacenters. In this dissertation, I argue that serving efficiency can be significantly improved by designing systems that are aware of the distinct phases of generative LLM inference: a compute-intensive prefill phase and a memory-intensive decode phase. Exploiting the unique properties of these phases unlocks significant performance gains at scale. My research validates this thesis through three studies. First, I address power constraints, a key bottleneck to datacenter growth. By analyzing how the distinct power demands of prefill and decode phases aggregate, I show that inference cluster power is underutilized. Based on this observation, I develop a power oversubscription framework that safely adds more servers under existing power budgets, increasing inference cluster capacity with minimal performance impact. Second, I show that running the compute-bound prefill and memory-bound decode phases on the same hardware leads to poor performance and resource stranding. To address these overheads, I introduce a new inference cluster architecture that disaggregates the phases onto hardware fleets specialized to better manage resources for each phase. This phase-separated cluster design yields substantial efficiency improvements over traditional approaches. Third, I extensively analyze the unique inefficiencies caused by conditional computation in Mixture-of-Experts (MoE) models, which I formalize as the MoE tax. This tax manifests differently across the two phases, for instance, creating load imbalance in prefill and increased memory transfers in decode. Based on this analysis, I propose phase-specific optimizations to address these bottlenecks and improve the efficiency of serving MoE models at scale. Collectively, these studies demonstrate that phase awareness is a key principle for designing efficient generative LLM serving systems.
ISBN:	9798293847723
Fuente:	ProQuest Dissertations & Theses Global