Benchmarking the coding strategies of non-coding mutations on sequence-based downstream tasks with machine learning

Guardat en:
Dades bibliogràfiques
Publicat a:bioRxiv (Jan 2, 2025)
Autor principal: Liu, Zhe
Altres autors: Bao, Yihang, Li, Wenhao, Li, Weihao, Guan Nin Lin
Publicat:
Cold Spring Harbor Laboratory Press
Matèries:
Accés en línia:Citation/Abstract
Full text outside of ProQuest
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!

MARC

LEADER 00000nab a2200000uu 4500
001 3150948807
003 UK-CbPIL
022 |a 2692-8205 
024 7 |a 10.1101/2025.01.01.631025  |2 doi 
035 |a 3150948807 
045 0 |b d20250102 
100 1 |a Liu, Zhe 
245 1 |a Benchmarking the coding strategies of non-coding mutations on sequence-based downstream tasks with machine learning 
260 |b Cold Spring Harbor Laboratory Press  |c Jan 2, 2025 
513 |a Working Paper 
520 3 |a Non-coding single nucleotide polymorphisms (SNPs) are critical drivers of gene regulation and disease susceptibility, yet predicting their functional impact remains a challenging task. A variety of methods exist for encoding non-coding SNPs, such as direct base encoding or using pre-trained models to obtain embeddings. However, there is a lack of comprehensive evaluation and guidance on the choice of encoding strategies for downstream prediction tasks involving non-coding SNPs. To address this gap, we present a benchmark study that compares six distinct encoding strategies for non-coding SNPs, assessing them across six dimensions, including interpretability, encoding abundance, and computational efficiency. Using three Quantitative Trait Loci (QTL)-related downstream tasks involving non-coding SNPs, we test these encoding strategies in combination with nine machine learning and deep learning models. Our findings demonstrate that semantic embeddings show strong robustness, while the choice of coding strategy and the model used for downstream prediction are all key variables influencing task performance. This benchmark provides actionable insights into the interplay between encoding strategies, models, and data properties, offering a framework for optimizing QTL prediction tasks and advancing the analysis of non-coding SNPs in genomic regulation.Competing Interest StatementThe authors have declared no competing interest. 
653 |a Machine learning 
653 |a Genomic analysis 
653 |a Single-nucleotide polymorphism 
653 |a Quantitative trait loci 
653 |a Models 
653 |a Predictions 
653 |a Gene regulation 
653 |a Deep learning 
653 |a Choice learning 
653 |a Learning algorithms 
700 1 |a Bao, Yihang 
700 1 |a Li, Wenhao 
700 1 |a Li, Weihao 
700 1 |a Guan Nin Lin 
773 0 |t bioRxiv  |g (Jan 2, 2025) 
786 0 |d ProQuest  |t Biological Science Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3150948807/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u https://www.biorxiv.org/content/10.1101/2025.01.01.631025v1