Theoretical Bounds and Systems Development for Exact/Approximate Multiplicities in Bag PDBs

Guardado en:

Detalles Bibliográficos
Publicado en:	ProQuest Dissertations and Theses (2025)
Autor principal:	Huber, Aaron
Publicado:	ProQuest Dissertations & Theses
Materias:	Computer science Computer engineering
Acceso en línea:	Citation/Abstract Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Descripción
Resumen:	Probabilistic databases (PDBs) provide users with a principled way to query data that is incomplete or imprecise. This work studies computing expected multiplicities of query results over probabilistic databases under bag semantics which has PTIME data complexity. However, does this imply that bag probabilistic databases are practical? We strive to answer this question from both a theoretical as well as a systems perspective.The problem of computing the marginal probability of a tuple in the result of a query over set probabilistic databases (PDBs) can be reduced to calculating the probability of the lineage formula of the result, a Boolean formula over random variables representing the existence of tuples in the database’s possible worlds. The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world. This work studies the problem of calculating the expectation of such polynomials (a tuple’s expected multiplicity) exactly and approximately. For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple’s lineage, if this polynomial is encoded as a sum of products (the standard operating procedure for Set-PDBs). This work further examines the fine-grained complexity of this problem for c-TIDBs, i.e., probabilistic databases where tuples are independent probabilistic events and the multiplicity of each tuple is bound by a constant c. Unfortunately, our results imply that computing expected multiplicities for c-TIDBs introduces super-linear overhead over the corresponding deterministic query evaluation algorithms (under certain complexity hardness conjectures). Using a reduction from the problem of counting k-matchings, this work demonstrates that calculating the expectation is #W[1]-hard when the polynomial is compressed, for example through factorization. Such factorized representations are exploited by modern join algorithms (e.g., worst-case optimal joins), and so our results imply that computing probabilities for Bag-PDB based on the results produced by such algorithms introduces super-linear overhead. Next, this work then develops a sampling algorithm that computes a (1 ± ϵ)-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any positive relational algebra (RA+ ) query over c-TIDBs and for a non-trivial subclass of block-independent databases. A remaining issue, however, is that constructing such circuits, while in PTIME, can nonetheless have significant overhead. To avoid this cost, we utilize approximate query processing techniques to directly sample monomials without materializing lineage upfront. Our implementation in FASTPDB provides accurate anytime approximation of probabilistic query answers and scales to datasets orders of magnitude larger than competing methods.By removing Bag-PDB’s reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases.
ISBN:	9798293833276
Fuente:	ProQuest Dissertations & Theses Global