摘要
针对高维度数据在数据挖掘过程中的隐私泄露及直接添加差分隐私噪声导致模型预测准确率低的问题,提出一种基于集成特征选择的差分隐私LightGBM算法,实现数据隐私性保护与可用性之间的平衡。将皮尔逊相关系数、随机森林、L1正则化、互信息、LightGBM等5种基本特征选择方法进行特征选择的结果使用投票累加法进行集成,根据集成特征选择出的前50个重要特征进行个性化隐私预算分配并使用拉普拉斯机制添加差分隐私噪声,再进行LightGBM算法进行模型训练。实验结果表明,提出的算法满足差分隐私机制,能够防止数据隐私信息发生泄漏,且相较于基于单一特征选择结果,加噪后再预测的方法准确率提高10.86%,F1-score值提高11.08%。
A differential privacy LightGBM algorithm based on integrated feature selection is proposed to address the issues of privacy leakage and low model prediction accuracy caused by direct addition of differential privacy noise in high-dimensional data mining processes,achieving a balance between data privacy protection and availability.The feature selection results of five basic feature selection methods,including Pearson correlation coefficient,random forest,L1 regularization,mutual information,and LightGBM,were integrated using the voting accumulation method.Based on the top 50 important features selected from the integrated features,personalized privacy budget allocation was carried out,and differential privacy noise was added using the Laplace mechanism.Then,the LightGBM algorithm was trained for the model.The experimental results show that the proposed algorithm satisfies the differential privacy mechanism,can prevent data privacy information leakage,and compared to the method based on single feature selection,the accuracy of the denoised prediction is improved by 10.86%,and the F1 score value is improved by 11.08%.
作者
靳珂
荣存庆
常锦才
JIN Ke;RONG Cun-qing;CHANG Jin-cai(College of Science,North China University of Science and Technology,Tangshan Hebeoi 063210,China)
出处
《华北理工大学学报(自然科学版)》
CAS
2024年第2期145-155,共11页
Journal of North China University of Science and Technology:Natural Science Edition
基金
国家自然科学基金项目(61702184)。
关键词
集成特征选择
个性化差分隐私
隐私保护
机器学习
integrated feature selection
personalized differential privacy
privacy protection
machine learning