We’re happy to announce that the paper, Debiasing Synthetic Data Generated by Deep Generative Models, conducted as part of the SYNDARA project, has been accepted to NeurIPS 2024! This collaborative work by dr. Alexander Decruyenaere, dr. Heidelinde Dehaene, both HINT.GENT members, and prof. Stijn Vansteelandt also earned the Best Poster Award at the 2024 Annual Meeting of the Royal Statistical Society of Belgium.
As the need for data sharing intensifies, especially in health research, so too does the challenge of safeguarding privacy. Synthetic data, which replicate the statistical properties of sensitive datasets without revealing individual records, have emerged as a promising solution. However, these data are not without flaws. When generated using deep generative models (DGMs), significant biases and inaccuracies can compromise their reliability for statistical analyses.
Key Challenges in Synthetic Data Analysis
In our prior work (spotlighted at UAI 2024), we demonstrated how DGMs can introduce substantial bias and imprecision in synthetic data analyses, leading to inflated type 1 error rates—essentially more false positives. This undermines the inferential reliability of synthetic data compared to analyses on original datasets. Existing methods that account for uncertainty in synthetic data often fall short, as they neglect the effects of regularization bias introduced by DGMs.
A Novel Debiasing Strategy
To address these challenges, we developed an innovative strategy that specifically targets biases in synthetic data generated by DGMs. This approach aims to restore the accuracy of statistical analyses, even for seemingly straightforward parameters like population means.
Our full findings are detailed in the paper, now available on arXiv.
About the SYNDARA Project
This work is part of the SYNDARA project (SYNthetic DAta for Research Acceleration), a collaboration between Ghent University Hospital and Ghent University. The project unites expertise from the Data Analysis and Statistical Science team and IDLab, supported by a dedicated research team (SYNDARA Team).
By addressing the limitations of current synthetic data methods, this research paves the way for more robust and privacy-preserving solutions in data-driven research. Stay tuned for more updates from SYNDARA!
