分布式数据挖掘中的最优K相异性取样技术被引量：5

Sampling method using optimizable K-dissimilarity for distributed data mining

在线阅读下载PDF

导出

摘要为了弥补基于集中式处理的分布式数据挖掘方法的不足,有效地实施分布式数据挖掘(DDM)任务,需要一种能从分布式数据源中获取多样化代表性取样集的技术.提出了一种新的适用于分布式数据挖掘环境的数据取样算法(OptiSim-DDM方法),算法核心是基于最优K相异性进行数据选择,采用移动Agent技术和扩展的最优K相异性数据多样化代表性子集选择方法,能在各分布式数据场地中轮转选择出全局数据集的多样化代表性取样集.该方法通过降低所挖掘的数据集的数据规模来降低数据挖掘算法的时空复杂度,降低网络通讯代价,提高数据挖掘的执行效率,适合于各场地数据是互相关联和互相依赖的分布式数据挖掘任务.实验结果证实该方法是可行、有效的. A sampling method to obtain a diversity representative subset from distributed data sources is necessary to avoid the shortcomings of client-serve methods based on centralized datasets and to effectively perform distributed data mining tasks. A novel data sampling method for distributed data mining, OptiSim-DDM, is proposed. Its main idea is data selection using optimizable K-dissimilarity selection. The OptiSim-DDM is an integration of the technology of mobile agents and an extending optimizable K-dissimilarity selection method. A diversity representative sampling dataset selected in turn from distributed data cites can be generated by use of this method. Apart from being able to reduce the complexity of time and space and to decrease the communication costs as well as improving the efficiency of performing data mining tasks in distributed environment by scaling down the dataset for data mining, the OptiSim-DDM is suitable for the cases that data mining is performed on a special sampling dataset generated by means of interaction and inter-combination of sites dataset in the distributed environment. The experimental results show that the new method is effective and efficient.

作者胡文瑜孙志挥张柏礼

机构地区东南大学计算机科学与工程学院

出处《东南大学学报（自然科学版）》 EI CAS CSCD 北大核心 2008年第3期385-389,共5页 Journal of Southeast University：Natural Science Edition

基金国家自然科学基金资助项目(70371015) 教育部高等学校博士点科研基金资助项目(20040286009) 福建省教育厅科技资助项目(JB06142)

关键词分布式数据挖掘最优K相异性选择算法 AGENT distributed data mining（DDM） optimizable K-dissimilarity selection method Agent

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献11

1Park B, Kargupta H. Distributed data mining: algorithms, systems, and applications[M]. Hillsdale, NJ: Lawrence Erlbaum, 2003:341 - 361.
2Zaki M J, Pan Y. Introduction: recent developments in parallel and distributed data mining[J]. Journal of Distrib Parallel Databases, 2002,11 ( 2 ) : 123 - 127.
3Ashrafi M Z, Taniar D, Smit K A. A data mining architecture for distributed environments [C]//Innovative Internet Computing Systems, Lecture Notes in Computer Science. Berlin, Germany: Springer-Verlag, 2002, 2346 : 27 - 38.
4Kargupta H, Park B. Collective data mining: a new perspective toward distributed data mining [ C ]//Advances in Distributed and Parallel Knowledge Discovery. Menlo Park. CA, USA: AAAI/MIT Press, 2000 : 131 - 178.
5Cabri G, Leonardi L, Zambonelli F. Mobile agent technology: current trends and perspectives [EB/OL]. (2002-11-10) [2007-05-02 ]. http.//polaris. ing. unimo. it/MOON/papers/aica98, pdf.
6Clark R D. OptiSim: an extended dissimilarity selection method for finding diverse representative subsets [J]. Journal of Chem Inf Computer Science, 1997,37 ( 6 ): 1181 - 1188.
7Clark R D, Langton W J. Balancing representativeness against diversity using optimizable K-dissimilarity and hierarchical clustering [J]. Journal of Chem Inf Computer Science, 1998,38 ( 6 ): 1079 - 1086.
8Soltanshahi F, Akella L, Clark R D. OptDesign : extending optimizable K-dissimilarity selection for use in combinatorial library design [J]. Journal of Chem Inf Computer Science, 2003,43( 3 ) : 829 - 836.
9胡文瑜,孙志挥,周晓云.基于相异性选择的密度聚类算法研究[J].小型微型计算机系统,2006,27(9):1601-1604. 被引量：2
10Zhong N, Matsui Y, Okuno T, et al. Framework of a multi-agent kdd system [C]//Proc of Intelligent Data Engineering and Automated Learning-IDEAL, Third International Conference. Manchester, UK: Springer- Verlag ,2002 : 337 - 346.

二级参考文献2

1周水庚,周傲英,曹晶,胡运发.一种基于密度的快速聚类算法[J].计算机研究与发展,2000,37(11):1287-1292. 被引量：89
2周水庚,范晔,周傲英.基于数据取样的DBSCAN算法[J].小型微型计算机系统,2000,21(12):1270-1274. 被引量：27

共引文献1

1安世全,丁进标,高涛.一种改进的分解-合并聚类方法[J].计算机工程与应用,2011,47(14):128-130.

同被引文献33

1胡雪琼,黄中艳,朱勇,王树会,邓云龙.云南烤烟气候类型及其适宜性研究[J].南京气象学院学报,2006,29(4):563-568. 被引量：46
2王树会,邵岩,李天福,邓云龙.云南12个地州植烟土壤养分状况与施肥对策[J].土壤通报,2006,37(4):684-687. 被引量：31
3王树会.云南烟区主要植烟土壤环境质量调查与评价[J].农业环境科学学报,2006,25(B09):579-581. 被引量：26
4张艳玲,尹启生,周汉平,王信民,蔡宪杰.中国烟叶铅、镉、砷的含量及分布特征[J].烟草科技,2006,39(11):49-52. 被引量：97
5李闯,丁晓青,吴佑寿.一种改进的AdaBoost算法——AD AdaBoost[J].计算机学报,2007,30(1):103-109. 被引量：54
6David Han,Heikki Mannila.Padhraic Smyth.Principles ofData Mining[M].张银奎,廖丽,宋俊等译,北京:机械工业出版社,2003.
7Peter van der Putten,Martijn Ramaekers.Marten den Uyl Joost Kok.A Process Model For a Data Fusion Factory[C].Sentient Machine Research Baarsjesweg 224,1058 Amsterdam,The Netherlands,2003.
8OLIVIA PARR RUD.Data Mining Cookbook[M].朱扬勇,左子叶,张忠平等译.北京:机械工业出版社,2003.
9Tao Li,Shen Ghuo-zhu.Mitsunori ogihara:a new distributed data mining midel based on similarity[J].Computer Science dept,univ.of rocester.
10中国烟草.提高国产烟叶整体质量水平为“中式卷烟”保驾护航[EB/OL].[2011—07—08].http://www.tobaccochina.com/tobaccoleaf/roundup/update/20047/200474112315.—157427.shtml.

引证文献5

1郑荔平.基于相似性的分布式数据挖掘[J].漳州师范学院学报（自然科学版）,2010,23(3):36-39. 被引量：1
2魏晓燕,和占辉,伊波,徐亮,李学卫,李兰周,李佛琳.丽江植烟气象、土壤及烟叶品质空间相似性算法研究[J].云南农业大学学报（自然科学版）,2011,26(B12):139-142.
3张成叔.关于数据挖掘取样方式的若干分析[J].赤峰学院学报（自然科学版）,2014,30(9):10-11. 被引量：3
4武靖娜,杨姝,王剑辉.一种分布式大数据挖掘的快速在线学习算法[J].沈阳师范大学学报（自然科学版）,2016,34(1):100-104. 被引量：3
5赵伟杰,户江民,文小琴.试论全局通讯网络模式的数据挖掘方法[J].中国新通信,2019,21(21):116-117.

二级引证文献7

1魏晓燕,和占辉,伊波,徐亮,李学卫,李兰周,李佛琳.丽江植烟气象、土壤及烟叶品质空间相似性算法研究[J].云南农业大学学报（自然科学版）,2011,26(B12):139-142.
2张本文.数据挖掘取样方法与数据结构研究[J].数字技术与应用,2016,34(12):106-106.
3杨品林.彩色图像数据库中目标特征数据挖掘方法[J].沈阳工业大学学报,2018,40(1):60-64. 被引量：13
4谢修娟,李香菊,操凤平,孙丽.基于改进C4.5的E-learning教学辅助系统的研究与实现[J].佳木斯大学学报（自然科学版）,2018,36(1):64-67. 被引量：4
5张成叔.数据挖掘技术在智能图书馆云检索系统中的应用研究[J].山西大同大学学报（自然科学版）,2020,36(6):37-41. 被引量：3
6张成叔.软件技术专业群“线上+线下混合教学模式”融合应用研究[J].长治学院学报,2021,38(2):99-103. 被引量：1
7信晓艺.基于分布式数据的学习分类器的研究[J].蚌埠学院学报,2022,11(2):76-80. 被引量：1

1胡文瑜,孙志挥,周晓云.基于相异性选择的密度聚类算法研究[J].小型微型计算机系统,2006,27(9):1601-1604. 被引量：2
2王翠茹,朵春红.一种改进的基于密度的DBSCAN聚类算法[J].广西师范大学学报（自然科学版）,2007,25(4):104-107. 被引量：4
3林德钰,王泉,刘伎昭.无线传感网的移动与静态sink相结合的节能策略[J].哈尔滨工业大学学报,2016,48(11):162-168. 被引量：5
4熊莺,赫江华.基于WWW的信息检索系统[J].交通与计算机,2000,18(2):45-49. 被引量：1
5窦小雨,李秦.P2P网络电子商务环境下信任的基本概念和相关理论分析[J].电脑编程技巧与维护,2010(14):65-66.
6管控一体化在我司的研究与应用[J].质量与市场,2007(11):71-75.
7李建伟,殷越.基于Agent的分布式入侵检测系统研究[J].电脑知识与技术,2009,5(8):6115-6116.
8居熙,方宁生,吴国新.基于分布式域的综合网管研究与实现[J].计算机工程与设计,2006,27(3):443-445. 被引量：3
9申永军,田喜伟,张峰.基于Agent的分布式入侵检测系统模型研究[J].微计算机信息,2008,24(30):49-51. 被引量：1
10康庄庄,陈群,孙林超.分布式RFID复杂事件处理技术的研究[J].计算机工程与科学,2011,33(12):136-142.

东南大学学报（自然科学版）

2008年第3期

浏览历史

内容加载中请稍等...

分布式数据挖掘中的最优K相异性取样技术被引量：5

参考文献11

二级参考文献2

共引文献1

同被引文献33

引证文献5

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

分布式数据挖掘中的最优K相异性取样技术 被引量：5

参考文献11

二级参考文献2

共引文献1

同被引文献33

引证文献5

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

分布式数据挖掘中的最优K相异性取样技术被引量：5