期刊文献+

基于MapReduce的文本层次聚类并行化 被引量:5

Parallel text hierarchical clustering based on MapReduce
在线阅读 下载PDF
导出
摘要 针对传统的层次聚类算法在处理大规模文本时可扩展性不足的问题,提出基于MapReduce编程模型的并行化文本层次聚类算法。将基于文本向量分量组特征统计的垂直数据划分算法应用于MapReduce的数据分发,将MapReduce的排序特性应用于合并点的选择,使得算法更加高效,同时有利于提高聚类精度。实验结果表明了利用该算法进行大规模文本聚类的有效性及良好的可扩展性。 Conceming the deficiency in scalability of the traditional hierarchical clustering algorithm when dealing with large-scale text, a parallel hierarchical clustering algorithm based on the MapReduce programming model was proposed. The vertical data partitioning algorithm based on the statistical characteristic of the components group of text vector was developed for data partitioning in MapReduee. Additionally, the sorting characteristics of the MapReduce were applied to select the merge points, making the algorithm be more efficient and conducive to improve clustering accuracy. The experimental results show that the proposed algorithm is effective and has good scalability.
出处 《计算机应用》 CSCD 北大核心 2014年第6期1595-1599,1680,共6页 journal of Computer Applications
基金 福建省科技计划重大项目(2011H6016) 福建省科技计划重点项目(2011H0028)
关键词 文本聚类 层次聚类 数据划分 MAPREDUCE 并行计算 text clustering hierarchical clustering data partitioning MapReduce parallel computing
  • 相关文献

参考文献11

二级参考文献57

  • 1江小平,李成华,向文,张新访,颜海涛.k-means聚类算法的MapReduce并行化实现[J].华中科技大学学报(自然科学版),2011,39(S1):120-124. 被引量:79
  • 2倪巍伟,陆介平,孙志挥.基于向量内积不等式的分布式k均值聚类算法[J].计算机研究与发展,2005,42(9):1493-1497. 被引量:15
  • 3刘远超,王晓龙,刘秉权.一种改进的k-means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15. 被引量:23
  • 4Phillips S.Content management:The new data infrastructure-convergence and divergence through chaos[M].Merrill Lynch.
  • 5Gulli A,Signorini A.The indexable web is more than 11.5 billion pages[S/OL]//Special interest tracks and posters of the 14th international conference on World Wide Web.Chiba,Japan:ACM,2005: 902-903.http ://portal.acm.org/citation.cfm?id= 1062789.
  • 6Han Jiawei,Micheline K.Data mining:Concepts and techniques[M]. 2nd.[S.l.] : Morgan Kaufmann Publisher, 2006.
  • 7Hotho A.A brief survey of text mining[J].LDV Forum-GLDV Journal for Computational Linguistics and Language Technology,2005, 20( 1 ) : 19-62.
  • 8Steinbach M.A comparison of document clustering techniques[D].Department of Computer Science and Engineering,University of Minnesota, 2000.
  • 9MacQueen J B.Some methods for classification and analysis of multivariate observations[C]//Cam L M L,Neyman J.Proc of the fifth Berkeley Symposium on Mathematical Statistics and Probability.University of California Press,1967:281-297.
  • 10Dhillon I S,Modha D S.A data-clustering algorithm on distributed memory muhiprocessors[C]//Revised Papers from Large-Scale Parallel Data Mining,Workshop on Large-Scale Parallel KDD Systems. Springer-Verlag, 2000 : 245-260.

共引文献221

同被引文献32

  • 1段铷,张彩庆,刘爱芳.模糊聚类在电力用户分类中的应用[J].电力需求侧管理,2005,7(5):18-20. 被引量:12
  • 2李培强,李欣然,陈辉华,唐外文.基于模糊聚类的电力负荷特性的分类与综合[J].中国电机工程学报,2005,25(24):73-78. 被引量:133
  • 3熊元新,陈允平.离散傅里叶变换的定义研究[J].武汉大学学报(工学版),2006,39(1):89-91. 被引量:10
  • 4朱映辉,江玉珍.BIRCH聚类算法优化及并行化研究[J].计算机工程与设计,2007,28(18):4345-4346. 被引量:9
  • 5ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: cluster computing with working sets [ C ] //Pro- ceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Berkeley, CA, USA: USENIX Associa- tion, 2010.
  • 6ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in- memory cluster computing [ C ] //Proceedings of the 9thUSENIX Conference on Networked Systems Design and Im- plementation. Berkeley, USA : USENIX Association, 2012: 1-14.
  • 7LIN X Q, WANG P, WU B. Log analysis in cloud compu- ting environment with Hadoop and Spark [ C ] //2013 5th IEEE International Conference on Broadband Network & Multimedia Technology IC-BNMT ). Guilin, China: IEEE, 2013: 273-276.
  • 8GU L, LI H. Memory or time: performance evaluation for iterative operation on Hadoop and Spark [ C ]. 2013 IEEE 10th International Conference on High Performance Com- puting and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC). Zhangjiajie, China: IEEE, 2013: 721- 727.
  • 9MCCALLUM A, NIGAM K, UNGAR L H. Efficient clustering of high-dimensional data sets with application to reference matching[ C ]//Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2000: 169- 178.
  • 10KANUNGO T, MOUNT D M, NETANYAHU N S, et al. An efficient k-means clustering algorithm: Analysis and implementation [ J 1- IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7) : 881-892.

引证文献5

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部