基于MapReduce的文本层次聚类并行化被引量：5

Parallel text hierarchical clustering based on MapReduce

在线阅读下载PDF

导出

摘要针对传统的层次聚类算法在处理大规模文本时可扩展性不足的问题,提出基于MapReduce编程模型的并行化文本层次聚类算法。将基于文本向量分量组特征统计的垂直数据划分算法应用于MapReduce的数据分发,将MapReduce的排序特性应用于合并点的选择,使得算法更加高效,同时有利于提高聚类精度。实验结果表明了利用该算法进行大规模文本聚类的有效性及良好的可扩展性。 Conceming the deficiency in scalability of the traditional hierarchical clustering algorithm when dealing with large-scale text, a parallel hierarchical clustering algorithm based on the MapReduce programming model was proposed. The vertical data partitioning algorithm based on the statistical characteristic of the components group of text vector was developed for data partitioning in MapReduee. Additionally, the sorting characteristics of the MapReduce were applied to select the merge points, making the algorithm be more efficient and conducive to improve clustering accuracy. The experimental results show that the proposed algorithm is effective and has good scalability.

作者余晓山吴扬扬

机构地区华侨大学计算机科学与技术学院

出处《计算机应用》 CSCD 北大核心 2014年第6期1595-1599,1680,共6页 journal of Computer Applications

基金福建省科技计划重大项目(2011H6016) 福建省科技计划重点项目(2011H0028)

关键词文本聚类层次聚类数据划分 MAPREDUCE 并行计算 text clustering hierarchical clustering data partitioning MapReduce parallel computing

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献11

1DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters [ J]. Communications of the ACM, 2008, 51 (1) : 107 - 113.
2Apache. PoweredBy [EB/OL]. [2013-10-09]. http://wiki, a- pache, org/hadoop/PoweredBy.
3赵卫中,马慧芳,傅燕翔,史忠植.基于云计算平台Hadoop的并行k-means聚类算法设计研究[J].计算机科学,2011,38(10):166-168. 被引量：84
4江小平,李成华,向文,张新访,颜海涛.k-means聚类算法的MapReduce并行化实现[J].华中科技大学学报（自然科学版）,2011,39(S1):120-124. 被引量：79
5张石磊,武装.一种基于Hadoop云计算平台的聚类算法优化的研究[J].计算机科学,2012,39(S2):115-118. 被引量：29
6毛典辉.基于MapReduce的Canopy-Kmeans改进算法[J].计算机工程与应用,2012,48(27):22-26. 被引量：66
7尹建君,王乐.数据划分优化的并行k-means算法[J].计算机工程与应用,2010,46(15):127-131. 被引量：7
8OLSON C F. Parallel algorithms for hierarchical clustering [ J]. Par- allel Computing, 1995,21(8) : 1313 - 1325.
9RAJASEKARAN S. Efficient parallel hierarchical clustering algo- rithms [ J]. IEEE Transactions on Parallel and Distributed Systems, 2005, 16(6) : 497 - 502.
10HAN J, KAMBER M. Data mining: concepts and techniques [ M]. 2nd ed. San Francisco: Morgan Kaufmann, 2006.

二级参考文献57

1江小平,李成华,向文,张新访,颜海涛.k-means聚类算法的MapReduce并行化实现[J].华中科技大学学报（自然科学版）,2011,39(S1):120-124. 被引量：79
2倪巍伟,陆介平,孙志挥.基于向量内积不等式的分布式k均值聚类算法[J].计算机研究与发展,2005,42(9):1493-1497. 被引量：15
3刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15. 被引量：23
4Phillips S.Content management:The new data infrastructure-convergence and divergence through chaos[M].Merrill Lynch.
5Gulli A,Signorini A.The indexable web is more than 11.5 billion pages[S/OL]//Special interest tracks and posters of the 14th international conference on World Wide Web.Chiba,Japan:ACM,2005: 902-903.http ://portal.acm.org/citation.cfm?id= 1062789.
6Han Jiawei,Micheline K.Data mining:Concepts and techniques[M]. 2nd.[S.l.] : Morgan Kaufmann Publisher, 2006.
7Hotho A.A brief survey of text mining[J].LDV Forum-GLDV Journal for Computational Linguistics and Language Technology,2005, 20( 1 ) : 19-62.
8Steinbach M.A comparison of document clustering techniques[D].Department of Computer Science and Engineering,University of Minnesota, 2000.
9MacQueen J B.Some methods for classification and analysis of multivariate observations[C]//Cam L M L,Neyman J.Proc of the fifth Berkeley Symposium on Mathematical Statistics and Probability.University of California Press,1967:281-297.
10Dhillon I S,Modha D S.A data-clustering algorithm on distributed memory muhiprocessors[C]//Revised Papers from Large-Scale Parallel Data Mining,Workshop on Large-Scale Parallel KDD Systems. Springer-Verlag, 2000 : 245-260.

共引文献221

1禤世丽,刘建明.基于Hadoop平台的K-means聚类算法并行化改进研究[J].玉林师范学院学报,2020(3):90-96.
2许云峰,张妍,赵铁军.基于云计算的商业情报采集系统[J].河北科技大学学报,2012,33(2):161-165. 被引量：7
3桂智明,向宇,李玉鉴.基于出租车轨迹的并行城市热点区域发现[J].华中科技大学学报（自然科学版）,2012,40(S1):187-190. 被引量：22
4张石磊,武装.一种基于Hadoop云计算平台的聚类算法优化的研究[J].计算机科学,2012,39(S2):115-118. 被引量：29
5武森,冯小东,吴庆海.基于稀疏指数排序的高维数据并行聚类算法[J].系统工程理论与实践,2011,31(S2):13-18. 被引量：1
6马礼,李敬喆,葛根焰,杨银刚.一种基于多核环境的海量数据快速读取方法[J].计算机研究与发展,2011,48(S1):63-67. 被引量：2
7原旭,陈志奎,赵亮,杨德礼.一种基于Hadoop的改进减法聚类算法[J].微电子学与计算机,2015,32(3):151-155. 被引量：1
8李青华,马春波.基于并行聚类算法的无监督异常检测研究[J].舰船电子工程,2012,32(1):79-82. 被引量：2
9徐晓旻,肖仰华.KBAC:一种基于K-means的自适应聚类[J].小型微型计算机系统,2012,33(10):2268-2272. 被引量：6
10杨阳,张为群,刘枫,黄仁杰.基于MapReduce自适应参数的粗糙K-modes算法研究[J].计算机科学,2012,39(11):149-152.

同被引文献32

1段铷,张彩庆,刘爱芳.模糊聚类在电力用户分类中的应用[J].电力需求侧管理,2005,7(5):18-20. 被引量：12
2李培强,李欣然,陈辉华,唐外文.基于模糊聚类的电力负荷特性的分类与综合[J].中国电机工程学报,2005,25(24):73-78. 被引量：133
3熊元新,陈允平.离散傅里叶变换的定义研究[J].武汉大学学报（工学版）,2006,39(1):89-91. 被引量：10
4朱映辉,江玉珍.BIRCH聚类算法优化及并行化研究[J].计算机工程与设计,2007,28(18):4345-4346. 被引量：9
5ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: cluster computing with working sets [ C ] //Pro- ceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Berkeley, CA, USA: USENIX Associa- tion, 2010.
6ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in- memory cluster computing [ C ] //Proceedings of the 9thUSENIX Conference on Networked Systems Design and Im- plementation. Berkeley, USA : USENIX Association, 2012: 1-14.
7LIN X Q, WANG P, WU B. Log analysis in cloud compu- ting environment with Hadoop and Spark [ C ] //2013 5th IEEE International Conference on Broadband Network & Multimedia Technology IC-BNMT ). Guilin, China: IEEE, 2013: 273-276.
8GU L, LI H. Memory or time: performance evaluation for iterative operation on Hadoop and Spark [ C ]. 2013 IEEE 10th International Conference on High Performance Com- puting and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC). Zhangjiajie, China: IEEE, 2013: 721- 727.
9MCCALLUM A, NIGAM K, UNGAR L H. Efficient clustering of high-dimensional data sets with application to reference matching[ C ]//Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2000: 169- 178.
10KANUNGO T, MOUNT D M, NETANYAHU N S, et al. An efficient k-means clustering algorithm: Analysis and implementation [ J 1- IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7) : 881-892.

引证文献5

1王德文,孙志伟.一种基于内存计算的电力用户聚类分析方法[J].智能系统学报,2015,10(4):569-576. 被引量：7
2李帅,吴斌,杜修明,陈玉峰.基于Spark的BIRCH算法并行化的设计与实现[J].计算机工程与科学,2017,39(1):35-41. 被引量：11
3李秋硕,王岩,孙宇军,肖勇,张朝鑫.K-means改进算法在电力用户聚类辨识中的应用[J].信息技术,2017,41(10):108-112. 被引量：8
4赵永彬,陈硕,刘明,王佳楠,贲驰.采用分布式DBSCAN算法的用电行为分析[J].小型微型计算机系统,2018,39(5):1108-1112. 被引量：8
5王辉,潘俊辉,Marius.Petrescu,王浩畅,张强.Hadoop下并行化实现文本聚类的优化算法[J].计算机与数字工程,2022,50(12):2611-2615. 被引量：2

二级引证文献36

1周琪,杨洁,韩俊杰,罗欣,赵燃.基于大数据的业扩用户用电行为特征研究[J].中国电力,2017,50(10):176-180. 被引量：6
2李秋硕,王岩,孙宇军,肖勇,张朝鑫.K-means改进算法在电力用户聚类辨识中的应用[J].信息技术,2017,41(10):108-112. 被引量：8
3罗有志,熊华斌.基于差异化密度聚类的电力客户画像分析[J].湖南电力,2017,37(A02):110-112. 被引量：5
4李俊,李玲娟.基于最小生成树的K-均值算法设计与并行化实现[J].南京邮电大学学报（自然科学版）,2017,37(5):81-86. 被引量：5
5朱子龙,李玲娟.基于Spark的密度聚类算法并行化研究[J].计算机技术与发展,2018,28(6):80-84. 被引量：5
6曹敏,邹京希,魏龄,赵旭,张林山,李鹏.基于RBF神经网络的配电网窃电行为检测[J].云南大学学报（自然科学版）,2018,40(5):872-878. 被引量：15
7王德文,周昉昉.基于无监督极限学习机的用电负荷模式提取[J].电网技术,2018,42(10):3393-3400. 被引量：19
8余翔,陈国洪,李霆,陈珺.基于孤立森林算法的用电数据异常检测研究[J].信息技术,2018,42(12):88-92. 被引量：38
9刘勇,何婧,姚绍文,向毅,张浩.基于重心点转移的St-DBSCAN改进算法[J].计算机技术与发展,2018,28(11):6-11. 被引量：2
10王蕾,焦明海,代勇,张倩.群体主动学习算法的移动电力交易行为研究[J].控制工程,2019,26(3):484-491. 被引量：6

1张爱琦,左万利,王英,梁浩.基于多个领域本体的文本层次被定义聚类方法[J].计算机科学,2010,37(3):199-204. 被引量：11
2高波,赵政.文本层次分类系统的研究[J].计算机工程与应用,2006,42(11):176-178. 被引量：5
3尉景辉,何丕廉,孙越恒.基于K-Means的文本层次聚类算法研究[J].计算机应用,2005,25(10):2323-2324. 被引量：18
4石威,方滨兴,胡铭曾.pC++语言中数据划分算法的研究与改进[J].软件学报,1999,10(9):985-988.
5王凌波,张昱,金心宇.基于码率的最优数据划分算法在嵌入式MPEG-4监控系统中的应用[J].电路与系统学报,2007,12(5):39-43.
6谢玉锋,郑禄.基于相似度代价计算的内存数据库集群数据划分[J].软件导刊,2017,16(4):182-184.
7李文,苗夺谦,卫志华,王炜立.基于阻塞先验知识的文本层次分类模型[J].模式识别与人工智能,2010,23(4):456-463. 被引量：4
8王习特,申德荣,白梅,聂铁铮,寇月,于戈.BOD:一种高效的分布式离群点检测算法[J].计算机学报,2016,39(1):36-51. 被引量：30
9董春丽,张平,韩林,林红军.自动计算分解和数据划分算法研究[J].微计算机信息,2005,21(11X):195-197. 被引量：2
10王青芸,程春玲.基于位置信息的移动SNS数据动态划分复制算法[J].计算机科学,2017,44(3):220-225.

计算机应用

2014年第6期

浏览历史

内容加载中请稍等...

基于MapReduce的文本层次聚类并行化被引量：5

参考文献11

二级参考文献57

共引文献221

同被引文献32

引证文献5

二级引证文献36

相关作者

相关机构

相关主题

浏览历史

基于MapReduce的文本层次聚类并行化 被引量：5

参考文献11

二级参考文献57

共引文献221

同被引文献32

引证文献5

二级引证文献36

相关作者

相关机构

相关主题

浏览历史

基于MapReduce的文本层次聚类并行化被引量：5