Theoretical Bounds and Systems Development for Exact/Approximate Multiplicities in Bag PDBs

Tallennettuna:
Bibliografiset tiedot
Julkaisussa:ProQuest Dissertations and Theses (2025)
Päätekijä: Huber, Aaron
Julkaistu:
ProQuest Dissertations & Theses
Aiheet:
Linkit:Citation/Abstract
Full Text - PDF
Tagit: Lisää tagi
Ei tageja, Lisää ensimmäinen tagi!

MARC

LEADER 00000nab a2200000uu 4500
001 3250297233
003 UK-CbPIL
020 |a 9798293833276 
035 |a 3250297233 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Huber, Aaron 
245 1 |a Theoretical Bounds and Systems Development for Exact/Approximate Multiplicities in Bag PDBs 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a Probabilistic databases (PDBs) provide users with a principled way to query data that is incomplete or imprecise. This work studies computing expected multiplicities of query results over probabilistic databases under bag semantics which has PTIME data complexity. However, does this imply that bag probabilistic databases are practical? We strive to answer this question from both a theoretical as well as a systems perspective.The problem of computing the marginal probability of a tuple in the result of a query over set probabilistic databases (PDBs) can be reduced to calculating the probability of the lineage formula of the result, a Boolean formula over random variables representing the existence of tuples in the database’s possible worlds. The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world. This work studies the problem of calculating the expectation of such polynomials (a tuple’s expected multiplicity) exactly and approximately. For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple’s lineage, if this polynomial is encoded as a sum of products (the standard operating procedure for Set-PDBs). This work further examines the fine-grained complexity of this problem for c-TIDBs, i.e., probabilistic databases where tuples are independent probabilistic events and the multiplicity of each tuple is bound by a constant c. Unfortunately, our results imply that computing expected multiplicities for c-TIDBs introduces super-linear overhead over the corresponding deterministic query evaluation algorithms (under certain complexity hardness conjectures). Using a reduction from the problem of counting k-matchings, this work demonstrates that calculating the expectation is #W[1]-hard when the polynomial is compressed, for example through factorization. Such factorized representations are exploited by modern join algorithms (e.g., worst-case optimal joins), and so our results imply that computing probabilities for Bag-PDB based on the results produced by such algorithms introduces super-linear overhead. Next, this work then develops a sampling algorithm that computes a (1 ± ϵ)-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any positive relational algebra (RA+ ) query over c-TIDBs and for a non-trivial subclass of block-independent databases. A remaining issue, however, is that constructing such circuits, while in PTIME, can nonetheless have significant overhead. To avoid this cost, we utilize approximate query processing techniques to directly sample monomials without materializing lineage upfront. Our implementation in FASTPDB provides accurate anytime approximation of probabilistic query answers and scales to datasets orders of magnitude larger than competing methods.By removing Bag-PDB’s reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases. 
653 |a Computer science 
653 |a Computer engineering 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3250297233/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3250297233/fulltextPDF/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch