摘要
针对传统的层次聚类算法在处理大规模文本时可扩展性不足的问题,提出基于MapReduce编程模型的并行化文本层次聚类算法。将基于文本向量分量组特征统计的垂直数据划分算法应用于MapReduce的数据分发,将MapReduce的排序特性应用于合并点的选择,使得算法更加高效,同时有利于提高聚类精度。实验结果表明了利用该算法进行大规模文本聚类的有效性及良好的可扩展性。
Conceming the deficiency in scalability of the traditional hierarchical clustering algorithm when dealing with large-scale text, a parallel hierarchical clustering algorithm based on the MapReduce programming model was proposed. The vertical data partitioning algorithm based on the statistical characteristic of the components group of text vector was developed for data partitioning in MapReduee. Additionally, the sorting characteristics of the MapReduce were applied to select the merge points, making the algorithm be more efficient and conducive to improve clustering accuracy. The experimental results show that the proposed algorithm is effective and has good scalability.
出处
《计算机应用》
CSCD
北大核心
2014年第6期1595-1599,1680,共6页
journal of Computer Applications
基金
福建省科技计划重大项目(2011H6016)
福建省科技计划重点项目(2011H0028)