摘要
为了弥补基于集中式处理的分布式数据挖掘方法的不足,有效地实施分布式数据挖掘(DDM)任务,需要一种能从分布式数据源中获取多样化代表性取样集的技术.提出了一种新的适用于分布式数据挖掘环境的数据取样算法(OptiSim-DDM方法),算法核心是基于最优K相异性进行数据选择,采用移动Agent技术和扩展的最优K相异性数据多样化代表性子集选择方法,能在各分布式数据场地中轮转选择出全局数据集的多样化代表性取样集.该方法通过降低所挖掘的数据集的数据规模来降低数据挖掘算法的时空复杂度,降低网络通讯代价,提高数据挖掘的执行效率,适合于各场地数据是互相关联和互相依赖的分布式数据挖掘任务.实验结果证实该方法是可行、有效的.
A sampling method to obtain a diversity representative subset from distributed data sources is necessary to avoid the shortcomings of client-serve methods based on centralized datasets and to effectively perform distributed data mining tasks. A novel data sampling method for distributed data mining, OptiSim-DDM, is proposed. Its main idea is data selection using optimizable K-dissimilarity selection. The OptiSim-DDM is an integration of the technology of mobile agents and an extending optimizable K-dissimilarity selection method. A diversity representative sampling dataset selected in turn from distributed data cites can be generated by use of this method. Apart from being able to reduce the complexity of time and space and to decrease the communication costs as well as improving the efficiency of performing data mining tasks in distributed environment by scaling down the dataset for data mining, the OptiSim-DDM is suitable for the cases that data mining is performed on a special sampling dataset generated by means of interaction and inter-combination of sites dataset in the distributed environment. The experimental results show that the new method is effective and efficient.
出处
《东南大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2008年第3期385-389,共5页
Journal of Southeast University:Natural Science Edition
基金
国家自然科学基金资助项目(70371015)
教育部高等学校博士点科研基金资助项目(20040286009)
福建省教育厅科技资助项目(JB06142)