Statistical Inference for Spatial Transcriptomics in the Age of Deep Learning

Guardado en:

Bibliografiske detaljer
Udgivet i:	ProQuest Dissertations and Theses (2025)
Hovedforfatter:	Kouznetsov, Roman
Udgivet:	ProQuest Dissertations & Theses
Fag:	Cellular biology Statistics Computer science Bioinformatics
Online adgang:	Citation/Abstract Full Text - PDF
Tags:	Tilføj Tag Ingen Tags, Vær først til at tagge denne postø!

MARC


LEADER	00000nab a2200000uu 4500
001	3245488818
003	UK-CbPIL
020			\|a 9798291566138
035			\|a 3245488818
045	2		\|b d20250101 \|b d20251231
084			\|a 66569 \|2 nlm
100	1		\|a Kouznetsov, Roman
245	1		\|a Statistical Inference for Spatial Transcriptomics in the Age of Deep Learning
260			\|b ProQuest Dissertations & Theses \|c 2025
513			\|a Dissertation/Thesis
520	3		\|a Single-cell spatial transcriptomics enables the measurement of gene expression of individual cells while simultaneously capturing the spatial positions of these cells within a tissue sample. To utilize these spatial positions effectively, careful model selection is required to ensure conclusions reflect spatial dependencies in the underlying biology. In this dissertation, we contribute three novel methodologies that merge deep learning with statistical inference for spatial transcriptomics data. First, we attempt to better predict gene expression by leveraging the spatial context included in spatial transcriptomics data. Comparing predictions from a spatial model to those from a baseline regressor without cell neighborhood information offers insights into how expression changes as a result of cell-cell communication (CCC) signals. However, to trust conclusions reached from such a paired modeling framework, we need to ensure that the baseline version of a model provides a valid non-spatial reference point. To this end, we develop a graph convolutional network (GCN) that uses graphs defined by cellular positions to predict gene expression. By encoding tissue samples as a graph, in which nodes represent cells and edges indicate spatial proximity between cells, we can leverage the full spatial layout and gene expression profile of the tissue. We find a marked performance gap between spatially aware and spatially ignorant models, highlighting the GCN’s ability to model spatial effects in both real and semi-synthetic settings. These results underscore the importance of model structure in spatial inference because a spatially ignorant version of GCNs can make better predictions than spatially aware versions of previous methods.Second, we study a clustering task for spatial transcriptomics data through a Bayesian framework. A central challenge in spatial transcriptomics is to identify distinct cell communities that not only reflect transcriptional heterogeneity but also preserve spatial coherence across tissue. These clusters often represent biological components such as cortical layers, tissue micro-environments, or pathological regions, whose spatial organization is critical for interpreting tissue structure and function. However, spatial transcriptomics data are collected at varying resolutions; as such, any spatial unit indexed by the data may contain multiple communities of varying memberships. Many exact Bayesian approaches model hard cluster assignments in their models, which limits their adaptability to datasets of varying resolutions. To address this limitation, we introduce a stochastic variational inference (SVI) method designed to learn posterior spot cluster distributions that are both spatially coherent and biologically interpretable. Our approach enhances clustering accuracy by incorporating spatial relationships through carefully designed prior distributions, allowing it to balance the trade-off between smoothness and expression differences. Furthermore, the method is scalable and effective across data resolutions. As spot data scales polynomially with finer resolution, SVI becomes a more favorable approach. It is more computationally efficient than previous methods that rely on posterior sampling techniques, such as Markov Chain Monte Carlo (MCMC), which can be prohibitively expensive to retrain. This method groups tissues into more contiguous regions compared to previous methods while preserving expression heterogeneity consistent with earlier studies, offering a competitive alternative to existing approaches.Third, to expand the work of Bayesian clustering with SVI, we leverage normalizing flows as the approximate posterior distributions for variational inference. Normalizing flows transform simple base distributions (e.g., Gaussian) into more expressive ones by stacking L invertible transformations based on the change-of-variables formula. By using normalizing flows instead of standard choices like a mean-field or full-covariance Gaussian as the approximate posterior, we can model more flexible, multi-modal posteriors over soft cluster assignments in a way that simpler variational families cannot express. We demonstrate that the posteriors learned by these normalizing flows accurately recover cluster membership compositions, guided by prior distributions that encode spatial dependencies.
653			\|a Cellular biology
653			\|a Statistics
653			\|a Computer science
653			\|a Bioinformatics
773	0		\|t ProQuest Dissertations and Theses \|g (2025)
786	0		\|d ProQuest \|t ProQuest Dissertations & Theses Global
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3245488818/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch
856	4	0	\|3 Full Text - PDF \|u https://www.proquest.com/docview/3245488818/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch