期刊文献+

面向大文本数据集的间接谱聚类 被引量:3

Indirect spectral clustering towards large text datasets
在线阅读 下载PDF
导出
摘要 针对谱聚类存在计算瓶颈的问题,提出了一种快速的集成算法,称为间接谱聚类。它首先运用K-Means算法对数据集进行过分聚类,然后把每个过分簇看成一个基本对象,最后在过分簇的级别上利用标准谱聚类来完成总体的聚类。将该思想应用于大文本数据集的聚类问题后,过分簇中心之间的相似性度度量方法可以采用常用的余弦距离法。在20-Newgroups文本数据上的实验结果表明:间接谱聚类算法在聚类准确性上比K-Means算法平均高出14.72%;比规范割谱聚类仅低0.88%,但算法所需的计算时间平均不到规范割谱聚类的1/16,且随着数据集的增大当规范割谱聚类遭遇计算瓶颈时,提出的算法却能快速地给出次优解。 To alleviate the computational bottleneck of spectral clustering,in this paper a general ensemble algorithm,called indirect spectral clustering,was developed.The algorithm first grouped a given large dataset into many over-clusters and then regarded each obtained over-cluster as a basic object.And then the standard spectral clustering ran at this object level.By convention,when applying this new idea to large text datasets,the cosine distance would be the appropriate manner in measuring the similarities between over-clusters.The empirical studies on 20-Newgroups dataset show that the proposed algorithm has a 14.72% higher accuracy on average than the K-Means algorithm and has a 0.88% lower accuracy than the normalized-cut spectral clustering.However,the proposed algorithm saves 16.8 times computation time compared to the normalized-cut spectral clustering.In conclusion,with the increase of data size,the computation time of the normalized-cut spectral clustering might become unacceptable;however,the proposed algorithm might efficiently give the near-optimal solutions.
出处 《计算机应用》 CSCD 北大核心 2012年第12期3274-3277,共4页 journal of Computer Applications
基金 山西省青年科技研究基金资助项目(2011021014-3)
关键词 谱聚类 文本聚类 大数据集 spectral clustering text clustering large dataset
  • 相关文献

参考文献20

  • 1MACQUEEN J B. Some methods of classification and analysis of multivariate observations [ C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: U- niversity of California Press, 1967:281 -297.
  • 2ESTER M, KRIEGEL H, SANDER J, et al. A density-based algo- rithm for discovering clusters in large spatial databases with noise [ C]// Proceedings of the 2nd Intemational Conference on Knowl- edge Discovery and Data Mining. Oregon: AAAI Press, 1996:226 -231.
  • 3WANG W, YANG J, MUNTZ R R. STING: A statistical informa- tion grid approach to spatial data mining[ C]/! Proceedings of the International Conference on Very Large Data Bases. Athens: AAAI/ MIT Press, 1997:186 - 195.
  • 4DEMPSTER A P, LAIRD N M, RUBIN D B. Maximum likelihood from incomplete data via the EM algorithm[ J]. Journal of the Royal Statistical Society: Series B, 1977,39(1): 1-38.
  • 5SHI J B, MALIK J. Normalized cuts and image segmentation[ J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.
  • 6ALZATE C, SUYKENS J A K. out-of-sample extensions through Transactions on Pattern Analysis 32(2): 335 -347. Multiway spectral clustering with weighted kernel PCA[ J]. IEEE and Machine Intelligence, 2010,.
  • 7MAIER M, LUXBURG U, HEIN M. Influence of graph construction on graph-based clustering measures[ C]//Advances in Neural Infor- mation Processing Systems. Cambridge, MA: MIT Press, 2009:1025 - 1032.
  • 8LUXBURG U. A tutorial on spectral clustering[ J]. Statistics and Computing, 2007, 17(4) : 395 -416.
  • 9CHEN W, SONG Y, BAI H, et al. Parallel spectral clustering indistributed systems[ J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33 (3) : 568 - 586.
  • 10RANGAPURAM S S, HEIN M. Constrained 1-spectral clustering [J]. Journal of Machine Learning Research, 2012, 22:1143 - 1151.

同被引文献22

引证文献3

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部