Hudovernik, V., Jurkoviˇc, M., & Strumbelj, E. (2024). Benchmarking the Fidelity and Utility of Synthetic Relational Data. arXiv preprint arXiv:2410.03411v1.
This research paper aims to address the lack of comprehensive benchmarking studies evaluating the fidelity and utility of synthetic relational data generated by current state-of-the-art methods.
The authors developed a novel benchmarking tool incorporating statistical, distance-based, and detection-based fidelity metrics, including a novel Discriminative Detection with Aggregation (DDA) method. They evaluated six methods (SDV, RC-TGAN, REaLTabFormer, ClavaDDPM, MostlyAI, and GretelAI) on six datasets (AirBnB, Rossmann, Walmart, Biodegradability, MovieLens, and Cora) with varying relational complexities. Utility was assessed through train-on-synthetic evaluate-on-real experiments for predictive modeling and feature importance ranking.
While promising, current synthetic relational data generation methods need further development to improve their ability to capture the complexities of real-world relational datasets, particularly in preserving multi-table relationships. The proposed benchmarking tool and DDA method provide a valuable resource for evaluating and guiding future research in this field.
This study highlights the limitations of existing synthetic relational data generation methods and emphasizes the need for improved techniques to ensure data fidelity and utility for privacy-sensitive applications and data sharing initiatives.
The study was limited to a specific set of methods, datasets, and metrics. Future research should explore a wider range of techniques, datasets with diverse characteristics, and privacy-preserving aspects of synthetic data generation.
A otro idioma
del contenido fuente
arxiv.org
Consultas más profundas