Automated Spectral Preprocessing via Bayesian Optimization for Chemometric Analysis of Milk Constituents

Guardado en:
Detalles Bibliográficos
Publicado en:Foods vol. 14, no. 17 (2025), p. 2996-3024
Autor principal: Babatunde Habeeb Abolaji
Otros Autores: McDougal, Owen M, Andersen, Timothy
Publicado:
MDPI AG
Materias:
Acceso en línea:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:The preprocessing of infrared spectra can significantly improve predictive accuracy for protein, carbohydrate, lipid, or other nutrition components, yet optimal preprocessing selection is typically empirical, tedious, and dataset specific. This study introduces a Bayesian optimization-based framework designed for the automated selection of optimal spectral preprocessing pipelines within a chemometric modeling context. The framework was applied to mid-infrared spectra of milk to predict compositional parameters for fat, protein, lactose, and total solids. A total of 385 averaged spectra corresponding to 198 unique samples was split into a 70/30 ratio (training/test) using a group-aware Kennard-Stone algorithm, resulting in 269 averaged spectra (135 unique samples) for training and 116 spectra (58 unique samples) for testing. Six regression models: Elastic Net, Gradient Boosting Machines (GBM), Partial Least Squares (PLS), RidgeCV Regression, LassoLarsCV, and Support Vector Regression (SVR) were evaluated across three preprocessing conditions: (1) no preprocessing, (2) literature-derived custom preprocessing (e.g., MSC, SNV, and first and second derivatives), and (3) optimized preprocessing via the proposed Bayesian framework. Optimized preprocessing consistently outperformed other methods, with RidgeCV achieving the best performance for all components except lactose, where PLS slightly outperformed it. Improvements in predictive accuracy, particularly in terms of RMSEP were observed across all milk components. The best RMSEP results were achieved for protein (RMSEP = 0.054, <inline-formula>R2=0.981</inline-formula>) and lactose (RMSEP = 0.026, <inline-formula>R2=0.917</inline-formula>), followed by fat (RMSEP = 0.139, <inline-formula>R2=0.926</inline-formula>) and total solids (RMSEP = 0.154, <inline-formula>R2=0.960</inline-formula>). Literature-based pipelines demonstrated inconsistent effectiveness, highlighting the limitations of transferring preprocessing methods between datasets. The Bayesian optimization approach identified relatively simple yet highly effective preprocessing pipelines, typically involving few steps. By eliminating manual trial and error, this data-driven strategy offers a robust and generalizable solution that streamlines spectral modeling in dairy analysis and can be readily applied to other types of spectroscopic data across various domains.
ISSN:2304-8158
DOI:10.3390/foods14172996
Fuente:Agriculture Science Database