Moving towards more holistic machine learning-based approaches for classification problems in animal studies
Guardado en:
| Publicado en: | bioRxiv (Jan 27, 2025) |
|---|---|
| Autor principal: | |
| Otros Autores: | , , , , , , |
| Publicado: |
Cold Spring Harbor Laboratory Press
|
| Materias: | |
| Acceso en línea: | Citation/Abstract Full Text - PDF Full text outside of ProQuest |
| Etiquetas: |
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
| Resumen: | Machine-learning (ML) is revolutionizing field and laboratory studies of animals. However, a challenge when deploying ML for classification tasks is ensuring the models are reliable. Currently, we evaluate models using performance metrics (e.g., precision, recall, F1), but these can overlook the ultimate aim, which is not the outputs themselves (e.g. detected species or individual identities, or behaviour) but their incorporation for hypothesis testing. As improving performance metrics has diminishing returns, particularly when data are inherently noisy (as human-labelled, animal-based data often are), researchers are faced with the conundrum of investing more time in maximising metrics versus doing the actual research. This raises the question: how much noise can we accept in ML models? Here, we start by describing an under-reported factor that can cause metrics to underestimate model performance. Specifically, ambiguity between categories or mistakes in labelling validation data produces hard ceilings that limit performance metrics. This likely widespread issue means that many models could be performing better than their metrics suggest. Next, we argue and show that imperfect models (e.g. low F1 scores) can still be useable. Using a case study on ML-identified behaviour from vulturine guineafowl accelerometer data, we first propose a simulation framework to evaluate robustness of hypothesis testing using models that make classification errors. Second, we show how to determine the utility of a model by supplementing existing performance metrics with 'biological validations' This involves applying ML models to unlabelled data and using the models' outputs to test hypotheses for which we can anticipate the outcome. Together, we show that effects sizes and expected biological patterns can be detected even when performance metrics are relatively low (e.g., F1: 60-70%). In doing so, we provide a roadmap for validation approaches of ML classification models tailored to research in animal behaviour, and other fields with noisy, biological data.Competing Interest StatementThe authors have declared no competing interest.Footnotes* Revision defines the scope of the paper more clearly (using machine-learning for the classification of raw data to be used in posterior hypothesis testing). Revision entails additional methodological details (Ethical note, Figure S2 to show alignment of accelerometer data with labels). |
|---|---|
| ISSN: | 2692-8205 |
| DOI: | 10.1101/2024.10.18.618969 |
| Fuente: | Biological Science Database |