Automated Spectral Preprocessing via Bayesian Optimization for Chemometric Analysis of Milk Constituents

Guardado en:
Bibliografiske detaljer
Udgivet i:Foods vol. 14, no. 17 (2025), p. 2996-3024
Hovedforfatter: Babatunde Habeeb Abolaji
Andre forfattere: McDougal, Owen M, Andersen, Timothy
Udgivet:
MDPI AG
Fag:
Online adgang:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
Beskrivelse
Resumen:The preprocessing of infrared spectra can significantly improve predictive accuracy for protein, carbohydrate, lipid, or other nutrition components, yet optimal preprocessing selection is typically empirical, tedious, and dataset specific. This study introduces a Bayesian optimization-based framework designed for the automated selection of optimal spectral preprocessing pipelines within a chemometric modeling context. The framework was applied to mid-infrared spectra of milk to predict compositional parameters for fat, protein, lactose, and total solids. A total of 385 averaged spectra corresponding to 198 unique samples was split into a 70/30 ratio (training/test) using a group-aware Kennard-Stone algorithm, resulting in 269 averaged spectra (135 unique samples) for training and 116 spectra (58 unique samples) for testing. Six regression models: Elastic Net, Gradient Boosting Machines (GBM), Partial Least Squares (PLS), RidgeCV Regression, LassoLarsCV, and Support Vector Regression (SVR) were evaluated across three preprocessing conditions: (1) no preprocessing, (2) literature-derived custom preprocessing (e.g., MSC, SNV, and first and second derivatives), and (3) optimized preprocessing via the proposed Bayesian framework. Optimized preprocessing consistently outperformed other methods, with RidgeCV achieving the best performance for all components except lactose, where PLS slightly outperformed it. Improvements in predictive accuracy, particularly in terms of RMSEP were observed across all milk components. The best RMSEP results were achieved for protein (RMSEP = 0.054, <inline-formula>R2=0.981</inline-formula>) and lactose (RMSEP = 0.026, <inline-formula>R2=0.917</inline-formula>), followed by fat (RMSEP = 0.139, <inline-formula>R2=0.926</inline-formula>) and total solids (RMSEP = 0.154, <inline-formula>R2=0.960</inline-formula>). Literature-based pipelines demonstrated inconsistent effectiveness, highlighting the limitations of transferring preprocessing methods between datasets. The Bayesian optimization approach identified relatively simple yet highly effective preprocessing pipelines, typically involving few steps. By eliminating manual trial and error, this data-driven strategy offers a robust and generalizable solution that streamlines spectral modeling in dairy analysis and can be readily applied to other types of spectroscopic data across various domains.
ISSN:2304-8158
DOI:10.3390/foods14172996
Fuente:Agriculture Science Database