SimdMinimizers: Computing random minimizers, fast

Guardado en:
Bibliografiske detaljer
Udgivet i:bioRxiv (Jan 27, 2025)
Hovedforfatter: Ragnar Groot Koerkamp
Andre forfattere: Martayan, Igor
Udgivet:
Cold Spring Harbor Laboratory Press
Fag:
Online adgang:Citation/Abstract
Full Text - PDF
Full text outside of ProQuest
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!

MARC

LEADER 00000nab a2200000uu 4500
001 3160207776
003 UK-CbPIL
022 |a 2692-8205 
024 7 |a 10.1101/2025.01.27.634998  |2 doi 
035 |a 3160207776 
045 0 |b d20250127 
100 1 |a Ragnar Groot Koerkamp 
245 1 |a SimdMinimizers: Computing random minimizers, fast 
260 |b Cold Spring Harbor Laboratory Press  |c Jan 27, 2025 
513 |a Working Paper 
520 3 |a Because of the rapidly-growing amount of sequencing data, computing sketches of large textual datasets has become an essential preprocessing task. These sketches are typically much smaller than the input sequences, but preserve sufficient information for downstream analysis. Minimizers are an especially popular sketching technique and used in a wide variety of applications. They sample at least one out of every w consecutive k-mers. As DNA sequencers are getting more accurate, some applications can afford to use a larger w and hence sparser and smaller sketches. And as sketches get smaller, their analysis becomes faster, so the time spent sketching the full-sized input becomes more of a bottleneck. Our library simd-minimizers implements a random minimizer algorithm using SIMD instructions. It supports both AVX2 and NEON architectures. Its main novelty is two-fold. First, it splits the input into 8 chunks that are streamed over in parallel through all steps of the algorithm. This is enabled by using the completely deterministic two-stacks sliding window minimum algorithm, which seems not to have been used before for finding minimizers. Our library is up to 9.5x faster than a scalar implementation of the rescan method when w=5 is small, and 4.5x faster for larger w=19. Computing canonical minimizers is only around 50% slower than computing forward minimizers, and around 16x faster than the existing implementation in the minimizer-iter crate. Our library finds all (canonical) minimizers of a 3.2Gbp human genome in 4.1 (resp. 6.0) seconds. Availability: simd-minimizers is available at https://github.com/rust-seq/simd-minimizersCompeting Interest StatementThe authors have declared no competing interest. 
653 |a Algorithms 
653 |a Nucleotide sequence 
700 1 |a Martayan, Igor 
773 0 |t bioRxiv  |g (Jan 27, 2025) 
786 0 |d ProQuest  |t Biological Science Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3160207776/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3160207776/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u https://www.biorxiv.org/content/10.1101/2025.01.27.634998v1