Understanding and Mitigating Memorization in Diffusion Models for Tabular Data

Guardado en:

書目詳細資料
發表在:	arXiv.org (Dec 15, 2024), p. n/a
主要作者:	Fang, Zhengyu
其他作者:	Jiang, Zhimeng, Chen, Huiyuan, Li, Xiao, Li, Jing
出版:	Cornell University Library, arXiv.org
主題:	Datasets Data augmentation Tables (data) Empirical analysis Image quality Clustering Synthetic data
在線閱讀:	Citation/Abstract Full text outside of ProQuest
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!

MARC


LEADER	00000nab a2200000uu 4500
001	3145907151
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3145907151
045	0		\|b d20241215
100	1		\|a Fang, Zhengyu
245	1		\|a Understanding and Mitigating Memorization in Diffusion Models for Tabular Data
260			\|b Cornell University Library, arXiv.org \|c Dec 15, 2024
513			\|a Working Paper
520	3		\|a Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization, where models inadvertently replicate exact or near-identical training data, has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. To address this issue, we propose TabCutMix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random same-class training sample pairs. Building upon this, we introduce TabCutMixPlus, an enhanced method that clusters features based on feature correlations and ensures that features within the same cluster are exchanged together during augmentation. This clustering mechanism mitigates out-of-distribution (OOD) generation issues by maintaining feature coherence. Experimental results across various datasets and diffusion models demonstrate that TabCutMix effectively mitigates memorization while maintaining high-quality data generation.
653			\|a Datasets
653			\|a Data augmentation
653			\|a Tables (data)
653			\|a Empirical analysis
653			\|a Image quality
653			\|a Clustering
653			\|a Synthetic data
700	1		\|a Jiang, Zhimeng
700	1		\|a Chen, Huiyuan
700	1		\|a Li, Xiao
700	1		\|a Li, Jing
773	0		\|t arXiv.org \|g (Dec 15, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3145907151/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2412.11044