An LLM-guided platform for multi-granular collection and management of data provenance

保存先:
書誌詳細
出版年:Journal of Big Data vol. 12, no. 1 (Jul 2025), p. 187
第一著者: Gregori, Luca
その他の著者: Lazzaro, Pasquale Leonardo, Lazzaro, Marialaura, Missier, Paolo, Torlone, Riccardo
出版事項:
Springer Nature B.V.
主題:
オンライン・アクセス:Citation/Abstract
Full Text
Full Text - PDF
タグ: タグ追加
タグなし, このレコードへの初めてのタグを付けませんか!

MARC

LEADER 00000nab a2200000uu 4500
001 3233582374
003 UK-CbPIL
022 |a 2196-1115 
024 7 |a 10.1186/s40537-025-01209-3  |2 doi 
035 |a 3233582374 
045 2 |b d20250701  |b d20250731 
100 1 |a Gregori, Luca  |u Università Roma Tre, DICITA, Roma, Italy (GRID:grid.8509.4) (ISNI:0000 0001 2162 2106) 
245 1 |a An LLM-guided platform for multi-granular collection and management of data provenance 
260 |b Springer Nature B.V.  |c Jul 2025 
513 |a Journal Article 
520 3 |a As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them. We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists. 
653 |a Machine learning 
653 |a Data processing 
653 |a Datasets 
653 |a Artificial intelligence 
653 |a Pipelines 
653 |a Data science 
653 |a Feature selection 
653 |a Algorithms 
653 |a Data collection 
653 |a Bias 
653 |a Big Data 
653 |a Manipulation 
653 |a Decision making 
653 |a Data 
700 1 |a Lazzaro, Pasquale Leonardo  |u Università Roma Tre, DICITA, Roma, Italy (GRID:grid.8509.4) (ISNI:0000 0001 2162 2106) 
700 1 |a Lazzaro, Marialaura  |u Università Roma Tre, DICITA, Roma, Italy (GRID:grid.8509.4) (ISNI:0000 0001 2162 2106) 
700 1 |a Missier, Paolo  |u University of Birmingham, School of Computer Science, Birmingham, UK (GRID:grid.6572.6) (ISNI:0000 0004 1936 7486) 
700 1 |a Torlone, Riccardo  |u Università Roma Tre, DICITA, Roma, Italy (GRID:grid.8509.4) (ISNI:0000 0001 2162 2106) 
773 0 |t Journal of Big Data  |g vol. 12, no. 1 (Jul 2025), p. 187 
786 0 |d ProQuest  |t ABI/INFORM Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3233582374/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text  |u https://www.proquest.com/docview/3233582374/fulltext/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3233582374/fulltextPDF/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch