Proc. of the 33rd Italian Symp. on Advanced Database Systems (SEBD). CEUR Workshop Proceedings, https://ceur-ws.org/. 2025.
This paper presents the vision of the S-PIC4CHU project, which aims to develop innovative models and techniques for scalable data preparation in Data Science and Machine Learning. The project focuses on leveraging data semantics throughout all data preparation stages to improve data quality and ensure unbiased results. The proposed approach involves a novel data preparation pipeline semantically enriched with domain knowledge from ontologies and knowledge graphs, along with novel, semantic-based techniques for data cleaning, integration, provenance, explanation, and quality management. The validation of the approach relies on use cases from different domains, with the goal of releasing open-source tools.
@inproceedings{SEBD-2025-spic4chu, title = "S-PIC4CHU: Semantics-based Provenance, Integrity, and Curation for Consistent, High-quality, and Unbiased Data Science", year = "2025", author = "Gianvincenzo Alfano and Ilaria Bartolini and Diego Calvanese and Paolo Ciaccia and Sergio Greco and Davide Lanti and Emilia Lenzi and Davide Martinenghi and Cristian Molinaro and Marco Patella and Letizia Tanca and Riccardo Torlone and Irina Trubitsyna", booktitle = "Proc. of the 33rd Italian Symp. on Advanced Database Systems (SEBD)", publisher = "CEUR-WS.org", series = "CEUR Workshop Proceedings, https://ceur-ws.org/", }pdf