Data Fusion for Integrative Species Identification Using Deep Learning

-д хадгалсан:
Номзүйн дэлгэрэнгүй
-д хэвлэсэн:bioRxiv (Jan 24, 2025)
Үндсэн зохиолч: Koesters, Lara M
Бусад зохиолчид: Karbstein, Kevin, Hofmann, Martin, Hodac, Ladislav, Maeder, Patrick, Waeldchen, Jana
Хэвлэсэн:
Cold Spring Harbor Laboratory Press
Нөхцлүүд:
Онлайн хандалт:Citation/Abstract
Full text outside of ProQuest
Шошгууд: Шошго нэмэх
Шошго байхгүй, Энэхүү баримтыг шошголох эхний хүн болох!

MARC

LEADER 00000nab a2200000uu 4500
001 3159549565
003 UK-CbPIL
022 |a 2692-8205 
024 7 |a 10.1101/2025.01.22.634270  |2 doi 
035 |a 3159549565 
045 0 |b d20250124 
100 1 |a Koesters, Lara M 
245 1 |a Data Fusion for Integrative Species Identification Using Deep Learning 
260 |b Cold Spring Harbor Laboratory Press  |c Jan 24, 2025 
513 |a Working Paper 
520 3 |a DNA analyses have revolutionized species identification and taxonomic work. Yet, persistent challenges arise from little differentiation among species and considerable variation within species, particularly among closely related groups. While images are commonly used as an alternative modality for automated identification tasks, their usability is limited by the same concerns. An integrative strategy, fusing molecular and image data through machine learning, holds significant promise for fine-grained species identification. However, a systematic overview and rigorous statistical testing concerning molecular and image preprocessing and fusion techniques, including practical advice for biologists, are missing so far. We introduce a machine learning scheme that integrates both molecular and image data for species identification. Initially, we systematically assess and compare three different DNA arrangements (aligned, unaligned, SNP-reduced) and two encoding methods (fractional, ordinal). Later, artificial neural networks are used to extract visual and molecular features, and we propose strategies for fusing this information. Specifically, we investigate three strategies: I) fusing directly after feature extraction, II) fusing features that passed through a fully connected layer after feature extraction, and III) fusing the output scores of both unimodal models. We systematically and statistically evaluate these strategies for four eukaryotic datasets, including two plant (Asteraceae, Poaceae) and two animal families (Lycaenidae, Coccinellidae) using Leave-One-Out Cross-Validation (LOOCV). In addition, we developed an approach to understand molecular- and image-specific identification failure. Aligned sequences with nucleotides encoded as decimal number vectors achieved the highest identification accuracy among DNA data preprocessing techniques in all four datasets. Fusing molecular and visual features directly after feature extraction yielded the best results for three out of four datasets (52-99%). Overall, combining DNA with image data significantly increased accuracy in three out of four datasets, with plant datasets showing the most substantial improvement (Asteraceae: +19%, Poaceae: +13.6%). Even for Lycaenidae with high identification accuracy based on molecular data (>96%), a statistically significant improvement (+2.1%) was observed. Detailed analysis of confusion rates between and within genera shows that DNA alone tends to identify the genus correctly, but often fails to recognize the species. The failure to resolve species is alleviated by including image data in the training. This increase in resolution hints at a hierarchical role of modalities in which molecular data coarsely groups the specimens to then be guided towards a more fine-grained identification by the connected image. We systematically showed and explained, for the first time, that optimizing the preprocessing and integration of molecular and image data offers significant benefits, particularly for genetically similar and morphologically indistinguishable species, enhancing species identification by reducing modality-specific failure rates and information gaps. Our results can inform integration efforts for various organism groups, improving automated identification across a wide range of eukaryotic species.Competing Interest StatementThe authors have declared no competing interest. 
653 |a Machine learning 
653 |a Accuracy 
653 |a Genera 
653 |a Datasets 
653 |a Identification 
653 |a Deoxyribonucleic acid--DNA 
653 |a Introduced species 
653 |a Plant extracts 
653 |a Image processing 
653 |a Single-nucleotide polymorphism 
653 |a Information processing 
653 |a Automation 
653 |a Statistical analysis 
653 |a Nucleotide sequence 
653 |a Deep learning 
653 |a Neural networks 
653 |a Learning algorithms 
653 |a Poaceae 
653 |a Asteraceae 
653 |a Lycaenidae 
700 1 |a Karbstein, Kevin 
700 1 |a Hofmann, Martin 
700 1 |a Hodac, Ladislav 
700 1 |a Maeder, Patrick 
700 1 |a Waeldchen, Jana 
773 0 |t bioRxiv  |g (Jan 24, 2025) 
786 0 |d ProQuest  |t Biological Science Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3159549565/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u https://www.biorxiv.org/content/10.1101/2025.01.22.634270v1