Fast and Memory-Efficient Dynamic Programming Approach for Large-Scale EHH-Based Selection Scans

Kaydedildi:

Detaylı Bibliyografya
Yayımlandı:	Molecular Biology and Evolution vol. 42, no. 11 (Nov 2025)
Yazar:	Rahman, Amatur
Diğer Yazarlar:	Smith, T Quinn, Szpiech, Zachary A
Baskı/Yayın Bilgisi:	Oxford University Press
Konular:	Statistics Dynamic programming Computation Source code Haplotypes Homozygosity Machine learning Algorithms Positive selection Genotypes Genomics Software Run time (computers) Environmental
Online Erişim:	Citation/Abstract Full Text - PDF
Etiketler:	Etiketle Etiket eklenmemiş, İlk siz ekleyin!

MARC


LEADER	00000nab a2200000uu 4500
001	3276084471
003	UK-CbPIL
022			\|a 0737-4038
022			\|a 1537-1719
024	7		\|a 10.1093/molbev/msaf275 \|2 doi
035			\|a 3276084471
045	2		\|b d20251101 \|b d20251130
084			\|a 78901 \|2 nlm
100	1		\|a Rahman, Amatur \|u Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA
245	1		\|a Fast and Memory-Efficient Dynamic Programming Approach for Large-Scale EHH-Based Selection Scans
260			\|b Oxford University Press \|c Nov 2025
513			\|a Journal Article
520	3		\|a Haplotype-based statistics are widely used for finding genomic regions under positive selection. At the heart of many such statistics is the computation of extended haplotype homozygosity (EHH), which captures the decay of homozygosity away from a focal site. This computation, repeated for potentially millions of sites, is computationally demanding, as it involves tracking counts of unique haplotypes iteratively over long genomic distances and across many individuals. Because of these computational challenges, existing tools do not scale well when applied to large-scale population datasets, such as the 1,000 Genomes Project, or the UK Biobank with 500,000 individuals. Optimizing computation becomes crucial when data sets grow large, especially when handling large sample sizes or generating training data for machine learning algorithms. Here, we propose a dynamic programming algorithm that substantially improves runtime and memory usage over existing tools on both real and simulated data. On real phased data, we achieve 5–50x speedup with minimal memory footprint. Our simulations show an even more pronounced performance gap with large populations (up to 15x speedup and 46x memory reduction). EHH-based statistics designed for unphased genotypes run an order of magnitude faster, and multi-parameter support results in 20x runtime improvement. Source code and binaries are available at https://github.com/szpiech/selscan as selscan v2.1.
653			\|a Statistics
653			\|a Dynamic programming
653			\|a Computation
653			\|a Source code
653			\|a Haplotypes
653			\|a Homozygosity
653			\|a Machine learning
653			\|a Algorithms
653			\|a Positive selection
653			\|a Genotypes
653			\|a Genomics
653			\|a Software
653			\|a Run time (computers)
653			\|a Environmental
700	1		\|a Smith, T Quinn \|u Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA
700	1		\|a Szpiech, Zachary A \|u Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA
773	0		\|t Molecular Biology and Evolution \|g vol. 42, no. 11 (Nov 2025)
786	0		\|d ProQuest \|t Health & Medical Collection
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3276084471/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch
856	4	0	\|3 Full Text - PDF \|u https://www.proquest.com/docview/3276084471/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch