Distributed Memory Algorithms for High-Dimensional Data Embedding
Guardado en:
| Publicado en: | ProQuest Dissertations and Theses (2025) |
|---|---|
| Autor principal: | |
| Publicado: |
ProQuest Dissertations & Theses
|
| Materias: | |
| Acceso en línea: | Citation/Abstract Full Text - PDF |
| Etiquetas: |
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
| Resumen: | High-dimensional data embedding converts high-dimensional data into a lower-dimensional embedding space while preserving intrinsic structure as much as possible. It is a fundamental technique in machine learning tasks such as node classification and link prediction in networks, as well as data visualization methods such as Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) and t-distributed Stochastic Neighbor Embedding (t-SNE). With technological advancements over time, datasets in real-world applications have grown exponentially in size and dimensionality. For instance, popular datasets, such as Facebook SimSearchNet++ and BIGANN, have 1B data points with over 100 dimensions. Given the sheer scale of these datasets, generating embeddings requires large-scale computing resources, such as distributed clusters and supercomputers. It necessitates research and development of distributed and scalable algorithms that can efficiently handle massive datasets without compromising accuracy.This dissertation develops new distributed algorithms and scalable software for embedding large-scale, high-dimensional data. We break the embedding problem into two subproblems: k-nearest neighbor graph (KNNG) construction and graph embedding. We identify these subproblems’ key sparse matrix operations and optimize them to minimize inter-process communication and efficient local computations. Firstly, we propose a novel distributed approximation algorithm for KNNG construction using sparse random forest techniques. Secondly, we discuss the KNNG graph embedding using a distributed force-directed graph embedding algorithm. Using a configurable push-pull approach, we propose a novel communication minimization technique for a minibatch algorithm. Thirdly, we combined the two phases and implemented a large-scale data visualization pipeline that can handle millions of data points that shared memory UMAP does not support. Finally, we present a novel distributed algorithm for Tall-and-Skinny Sparse Matrix-Matrix Multiplication (TS-SpGEMM). The proposed techniques can be used in implementing multi-source breadth-first search, influence maximization, and sparse embedding applications.Our algorithms achieve unprecedented scalability across thousands of cores on modern supercomputers by minimizing inter-process communication. Consequently, this dissertation advances large-scale data embedding and visualization across diverse scientific domains. |
|---|---|
| ISBN: | 9798314896259 |
| Fuente: | ProQuest Dissertations & Theses Global |