面向大文本数据集的间接谱聚类被引量：3

Indirect spectral clustering towards large text datasets

在线阅读下载PDF

导出

摘要针对谱聚类存在计算瓶颈的问题,提出了一种快速的集成算法,称为间接谱聚类。它首先运用K-Means算法对数据集进行过分聚类,然后把每个过分簇看成一个基本对象,最后在过分簇的级别上利用标准谱聚类来完成总体的聚类。将该思想应用于大文本数据集的聚类问题后,过分簇中心之间的相似性度度量方法可以采用常用的余弦距离法。在20-Newgroups文本数据上的实验结果表明:间接谱聚类算法在聚类准确性上比K-Means算法平均高出14.72%;比规范割谱聚类仅低0.88%,但算法所需的计算时间平均不到规范割谱聚类的1/16,且随着数据集的增大当规范割谱聚类遭遇计算瓶颈时,提出的算法却能快速地给出次优解。 To alleviate the computational bottleneck of spectral clustering,in this paper a general ensemble algorithm,called indirect spectral clustering,was developed.The algorithm first grouped a given large dataset into many over-clusters and then regarded each obtained over-cluster as a basic object.And then the standard spectral clustering ran at this object level.By convention,when applying this new idea to large text datasets,the cosine distance would be the appropriate manner in measuring the similarities between over-clusters.The empirical studies on 20-Newgroups dataset show that the proposed algorithm has a 14.72% higher accuracy on average than the K-Means algorithm and has a 0.88% lower accuracy than the normalized-cut spectral clustering.However,the proposed algorithm saves 16.8 times computation time compared to the normalized-cut spectral clustering.In conclusion,with the increase of data size,the computation time of the normalized-cut spectral clustering might become unacceptable;however,the proposed algorithm might efficiently give the near-optimal solutions.

作者侯海霞原民民刘春霞

机构地区太原大学计算机工程系山西水利职业技术学院信息工程系太原科技大学计算机科学与技术学院

出处《计算机应用》 CSCD 北大核心 2012年第12期3274-3277,共4页 journal of Computer Applications

基金山西省青年科技研究基金资助项目(2011021014-3)

关键词谱聚类文本聚类大数据集 spectral clustering text clustering large dataset

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献20

1MACQUEEN J B. Some methods of classification and analysis of multivariate observations [ C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: U- niversity of California Press, 1967:281 -297.
2ESTER M, KRIEGEL H, SANDER J, et al. A density-based algo- rithm for discovering clusters in large spatial databases with noise [ C]// Proceedings of the 2nd Intemational Conference on Knowl- edge Discovery and Data Mining. Oregon: AAAI Press, 1996:226 -231.
3WANG W, YANG J, MUNTZ R R. STING: A statistical informa- tion grid approach to spatial data mining[ C]/! Proceedings of the International Conference on Very Large Data Bases. Athens: AAAI/ MIT Press, 1997:186 - 195.
4DEMPSTER A P, LAIRD N M, RUBIN D B. Maximum likelihood from incomplete data via the EM algorithm[ J]. Journal of the Royal Statistical Society: Series B, 1977,39(1): 1-38.
5SHI J B, MALIK J. Normalized cuts and image segmentation[ J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.
6ALZATE C, SUYKENS J A K. out-of-sample extensions through Transactions on Pattern Analysis 32(2): 335 -347. Multiway spectral clustering with weighted kernel PCA[ J]. IEEE and Machine Intelligence, 2010,.
7MAIER M, LUXBURG U, HEIN M. Influence of graph construction on graph-based clustering measures[ C]//Advances in Neural Infor- mation Processing Systems. Cambridge, MA: MIT Press, 2009:1025 - 1032.
8LUXBURG U. A tutorial on spectral clustering[ J]. Statistics and Computing, 2007, 17(4) : 395 -416.
9CHEN W, SONG Y, BAI H, et al. Parallel spectral clustering indistributed systems[ J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33 (3) : 568 - 586.
10RANGAPURAM S S, HEIN M. Constrained 1-spectral clustering [J]. Journal of Machine Learning Research, 2012, 22:1143 - 1151.

同被引文献22

1姚振康,高国飞,郑汉,黄兆察.基于谱聚类的城市轨道交通车站间客流分型研究[J].都市快轨交通,2022,35(2):99-104. 被引量：8
2向世明,赵国英,崔丽,陈睿,李华.拓扑图格独立分量分析和谱聚类支持的纹理探测[J].计算机辅助设计与图形学学报,2005,17(5):935-940. 被引量：3
3张云,冯博琴,麻首强,刘连梦.蚁群-遗传融合的文本聚类算法[J].西安交通大学学报,2007,41(10):1146-1150. 被引量：15
4Jiang Hua, Yi Shenghe, Li Jing, et al. Ant clustering algorithm with K- harmonic means clustering[ J]. Expert Systems with Applications, 2010,37(12) :8679-8684.
5Mahdavi M, Abolhassani H. Harmony K-means algorithm for document clustering[ J ]. Data Mining and Knowledge Discovery, 2009,38 (3) :370-391.
6Shi Kansheng, Li Leming. High performance genetic algorithm based text clustering using parts of speech and outlier elimination [ J ]. Ap- plied Intelligence,2013,38(4) :511-519.
7蔡晓妍,戴冠中,杨黎斌.谱聚类算法综述[J].计算机科学,2008,35(7):14-18. 被引量：189
8王春腾,杨厚群,符传谊,邢洁清.基于独立成分分析的谱聚类方法[J].安徽电子信息职业技术学院学报,2011,10(3):41-43. 被引量：2
9陶红,周永梅,高尚.一种基于语义相似度的群智能文本聚类的新方法[J].计算机应用研究,2012,29(2):482-484. 被引量：3
10游张平,江洁,胡小平,叶晓平.起重机液压系统的粒子群神经网络故障诊断[J].液压与气动,2014,38(1):114-118. 被引量：5

引证文献3

1柯钢.基于增强蜂群优化与K-means的文本聚类算法[J].计算机应用研究,2016,33(8):2298-2302. 被引量：8
2邢洁清,符传谊.谱聚类算法及其研究进展[J].电脑知识与技术,2016,0(7):159-161.
3李立晶,常大帅,李磊,柴君飞.基于谱聚类的救援提升车故障诊断方法[J].煤田地质与勘探,2023,51(3):186-194.

二级引证文献8

1赵文昌,李忠木.融合改进人工蜂群和K均值聚类的图像分割[J].液晶与显示,2017,32(9):726-735. 被引量：12
2朱圣烽.融合人工蜂群和混沌映射的混合视频水印算法[J].图学学报,2018,39(1):21-29. 被引量：1
3李海洋,何红洲.改进人工蜂群优化的K均值图像分割算法[J].智能计算机与应用,2018,8(3):45-49. 被引量：6
4沈美英.基于免疫网络学习机制的中文网络短文本聚类算法[J].自动化与仪器仪表,2018,0(10):185-186.
5温廷新,李洋子,孙静霜.基于多因素特征选择与AFOA/K-means的新闻热点发现方法[J].数据分析与知识发现,2019,3(4):97-106. 被引量：5
6田夏利,熊莹.融入新的特征选择机制的文本数据聚类算法[J].计算机工程与设计,2021,42(3):734-741. 被引量：2
7王琛,董永权.融合化学反应优化与K均值的文本数据聚类[J].计算机工程与设计,2021,42(8):2248-2256.
8菊花.基于改进磷虾群算法的多目标文本聚类方法[J].计算机工程与设计,2022,43(6):1694-1703. 被引量：2

1轻松互动营[J].科技展望（幻想大王）,2005(12S):31-31.
2涂兰敬.开放标准打开市场需求[J].软件世界,2009(3):37-37.
3于江.Photoshop CS5试用[J].中国摄影,2010,0(11):142-143.
4华硕首推皮穆级精确时钟控制发烧声卡[J].现代电子技术,2009,32(11):144-144.
5铁马.局外人(外一篇)[J].黄河文学,2013(8):100-104.
6怪杰的正确打开方式[J].大学（A版）（阅读独唱团）,2016,0(10):34-35.
7雷新德.“平幕王”ADI MicroScan G910显示器印象[J].软件世界,2000(7):132-133.
8柳建明.土造订婚证书[J].中国收藏,2012(11):127-127.
9佚明.生活在后妈阴影里的林彪长女——林晓霖[J].党建文汇（下半月）,2009(2):43-43.
10GAO Xiang,LI Xi,JI Hong,LI Yi.Coalition-based downlink resource allocation for LTE system with divide-and-conquer approach[J].The Journal of China Universities of Posts and Telecommunications,2012,19(6):1-5.

计算机应用

2012年第12期

浏览历史

内容加载中请稍等...

面向大文本数据集的间接谱聚类被引量：3

参考文献20

同被引文献22

引证文献3

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

面向大文本数据集的间接谱聚类 被引量：3

参考文献20

同被引文献22

引证文献3

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

面向大文本数据集的间接谱聚类被引量：3