一种基于狄利克雷过程混合模型的文本聚类算法被引量：10

A Document Clustering Algorithm Based on Dirichlet Process Mixture Model

在线阅读下载PDF

导出

摘要随着互联网的普及,论坛、微博、微信等新媒体已经成为人们获取和发布信息的重要渠道,而网络中的这些文本数据,由于文本数目和内容的不确定性,给网络舆情聚类分析工作带来了很大的挑战。在文本聚类分析中,选择合适的聚类数目一直是一个难点。文章提出了一种基于狄利克雷过程混合模型的文本聚类算法,该算法基于非参数贝叶斯框架,可以将有限混合模型扩展成无限混合分量的混合模型,使用狄利克雷过程中的中国餐馆过程构造方式,实现了基于中国餐馆过程的狄利克雷混合模型,然后采用吉布斯采样算法近似求解模型,能够在不断的迭代过程中确定文本的聚类数目。实验结果表明,文章提出的聚类算法,和经典的K-means聚类算法相比,不仅能更好的动态确定文本主题聚类数目,而且该算法的聚类质量(纯度、F-score和轮廓系数)明显好于K-means聚类算法。 With the prevalence of Internet, network forum, microblog, WeChat, etc are an important channel for people to obtain and publish information. However, the uncertainty of the documents quantity and content brings great challenge for Internet public opinion analysis. In document clustering, choosing a right clustering number is a hard task. In this paper, a document clustering algorithm based on Dirichlet process mixture model （DCA-DPMM） was proposed. DCA-DPMM could extends standard ifnite mixture models to an infinite number of mixture components, using CRP（Chinese restaurant process） of the Dirichlet Process, this paper implement Dirichlet process mixture model based on CRP. The clustering assignment of data points could be sampled at different iterations by the Gibbs sampling algorithm. The experiments results showed that the proposed document clustering algorithm, compared with classical K-means clustering algorithm, not only could determine the clustering number dynamically, but also can improve the clustering quality such as purity, F-score and silhouette coefifcient.

作者高悦王文贤杨淑贤

机构地区四川大学计算机学院网络与可信计算研究所四川大学网络空间安全研究院最高人民检察院

出处《信息网络安全》 2015年第11期60-65,共6页 Netinfo Security

基金国家科技支撑计划[2012BAH18B05] 国家自然科学基金[61272447]

关键词文本聚类狄利克雷过程混合模型非参数贝叶斯吉布斯采样 document clustering Dirichlet process mixture model Bayesian nonparametrics Gibbs sampling

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献21

1Jiawei H, Kamber M. Data Mining Concepts and Techniques (Third Edition) [M]. San Francisco: Morgan Kaufmann, 2011.
2Hartigan J A, Wong M A. Algorithm AS 136: A k-means Clustering Algorithm [J]. Applied Statistics, 1979, 28(1): 100-108.
3Bouman C A, Shapiro M, Cook G W, et al. Cluster: An Unsupervised Algorithm for Modeling Gaussian Mixtures [EB/OL].Online available: https://engineering.purdue.edu/-bouman/software/cluster/manual.pdf, 2015-8-16.
4Sharif-lKazavian N, Zollmann A. An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing [EB/OL]. Online available: http://www.cs.cmu.edu/-zoUmann/publications/nonparametric. pdf,, 2015-8-16.
5Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes [J]. Journal of the American Statistical Association, 2006, 101(476): 1566- 1581.
6Vlachos A, Korhonen A, Ghahramani Z. Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering[C]// Proceedings of the Workshop on Geometrical Models of Natural Language Semantics. Association for Computational Linguistics, 2009: 74-82.
7Fox E B, Choi D S, Willsky A S. Nonparametric Bayesian Methods for Large Scale Multi-target Tracking[C]//ACSSC'06. Fortieth Asilomar Conference on Signals, Systems and Computers. IEEE, 2006: 2009- 2013.
8Zhang Z H, Dai G, Jordan M I. Matrix-variate Dirichlet Process Mixture Models[C]// Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Sardinia, Italy: The MIT Press, 2010:980-987.
9Ferguson T S. A Bayesian Analysis of Some Nonparametric Problems [J]. The Annals of Statistics, 1973, 1(2): 209-230.
10Teh Y W. Dirichlet Process [M]. Springer US, 2010.

二级参考文献29

1刘毅.略论网络舆情的概念、特点、表达与传播[J].理论界,2007(1):11-12. 被引量：317
2夏士雄,李文超,周勇,张磊,牛强.Improved k-means clustering algorithm[J].Journal of Southeast University(English Edition),2007,23(3):435-438. 被引量：16
3ZHANG C, XIA S. K-means clustering algorithm with improvedinitial center[C]//Second International Workshop on, IEEE, 2009:790-792.
4GREEN R, STAFFELL I, VASILAKOS N. Divide and conquerk-means clustering of demand data allows rapid and accurate simulationsof the British electricity systemQ]. IEEE TRANSACTIONS ONENGINEERING MANAGEMENT, 2014,61(2): 251-260.
5DIN W I S W, YAHYA S, TAIB M N, et al. MAP: The newclustering algorithm based on multitier network topology to prolongthe lifetime of wireless sensor network[C]//Signal Processing & itsApplications (CSPA), 2014 IEEE 10th International Colloquium on, IEEE,2014: 173-177.
6JI T, BAO X,WANG Y, et al. A Fuzzy K-modes-based Algorithmfor Soft Subspace Clustering[C] //Fuzzy Systems and Knowledge Discovery(FSKD), 2011 Eighth International Conference on, IEEE, 2011,2: 1080-1084.
7Lain, Chuck. Hadoop in action[M]. Greenwich ,Connecticut: Manning Publications Co., 2010.
8McQueen J. Some methods for classification and analysis of multivariate observations[C]//Proc, of the 5th Berkeley Symp. On Math. Stat. and Prob. 1967,(1):281-296.
9Dean, Jeffrey, Sanjay Ghemawat. MapReduce: simplified data processing on large clusters[C]//Communications of the ACM 51, 2008,(1): 107-113.
10McCallum A, Nigam K, Lyle H. Ungar: efficient clustering of high- dimensional data sets with application to reference matching[C]//Proc, of the 6th ACM SIGKDD, 2000:169-178.

共引文献43

1王乐,王勇,王东安,徐小琳.社交网络中信息传播预测的研究综述[J].信息网络安全,2015(5):47-55. 被引量：12
2俞诗源,程三军.大数据工具在网络攻击监测中的应用[J].信息网络安全,2015(9):149-153. 被引量：3
3张越今,丁丁.敏感话题发现中的增量型文本聚类模型[J].信息网络安全,2015(9):170-174. 被引量：6
4何慧虹,王勇,史亮.分布式环境下基于ZooKeeper服务的数据同步研究[J].信息网络安全,2015(9):227-230. 被引量：12
5戴晓苗,管磊,胡光俊.基于大数据的公安网安全事件检测方案[J].信息网络安全,2015(9):245-248. 被引量：3
6林思娟,林柏钢,许为,杨旸.一种基于词语能量值变化的微博热点话题发现方法研究[J].信息网络安全,2015(10):46-52. 被引量：7
7张俊豪,顾益军,张士豪.基于距离模型的用户关系强度评估[J].信息网络安全,2015(10):86-91. 被引量：2
8胡雪,封化民,李明伟,丁钊.数据挖掘中一种增强的Apriori算法分析[J].信息网络安全,2015(11):77-83. 被引量：16
9甘莅豪.大数据时代专家在舆论场中的公信力分析[J].北京理工大学学报（社会科学版）,2019,21(4):181-188. 被引量：4
10肖如林,曹飞,蔡明勇,杨旻,申文明,侯鹏,李静.基于互联网与遥感的网络环境舆情联动监控技术应用[J].环境与可持续发展,2016,41(2):24-26. 被引量：2

同被引文献68

1余传明,钟韵辞,林奥琛,安璐.基于网络表示学习的作者重名消歧研究[J].数据分析与知识发现,2020,4(2):48-59. 被引量：11
2徐晓日.网络舆情事件的应急处理研究[J].华北电力大学学报（社会科学版）,2007(1):89-93. 被引量：142
3徐创文,陈花玲,郭攀成,严慧萍.基于径向基函数网络的刀具磨损识别[J].测试技术学报,2007,21(3):219-224. 被引量：2
4WANGMeng, LIN Lanfen, WANG Jing, et al. Improving Short TextClassification Using Public Search Engines[C]//NAFOSTED, Springer.2013 International Symposium on Integrated Uncertainty in KnowledgeModelling and Decision Making, July 12-14,2013,Beijing, China.Heidelberg: Springer, 2013: 157-166.
5NINGYahui, ZHANG Li, JU Yarong, et al. Using SemanticCorrelation of HowNet for Short Text Classification [EB/OL].https://www.researchgate.net/publication/269359136_Using_Semantic_Correlation_of_HowNet_for_Short_Text_Classification,2016-1-22.
6BEAUXB P. SHARIFI, INOUYE D I, KALITA J K.Summarization of Twitter Microblogs [J]. The Computer Journal, 2014,57(3): 378-402.
7LEQ, MIKOLOV T. Distributed Representations of Sentences andDocuments [CJ//IMLS. 31st International Conference on MachineLearning, June 21-26,2014, Beijing, China. Los Alamos: Eprint Arxiv,2014: 1188-1196.
8BLEID M, NG A Y, JORDAN A Y. Latent Dirichlet Allocation[J].The Journal of Machine Learning Research, 2003, 3(1): 993-1022.
9VOD T, OCK C Y. Learning to Classify Short Text from ScientificDocuments Using Topic Models with Various Types of" Knowledge [J].Expert Systems with Applications, 2015, 42(3): 1684-1698.
10CHANG C C, LIN C J. LIBSVM: A Library for Support VectorMachines[J].ACM Transactions on Intelligent Systems and Technology,2011,2(3): 1-27.

引证文献10

1尚海,罗森林,韩磊,张笈.基于句义成分的短文本表示方法研究[J].信息网络安全,2016(5):64-70. 被引量：6
2王毅,唐勇,卢泽新,俞昕.恶意代码聚类中的特征选取研究[J].信息网络安全,2016(9):64-68. 被引量：10
3姚兴仁,赵刚,吴惟希.基于“智能信息中心”的蚁群文本聚类算法改进[J].信息安全研究,2017,3(2):160-165. 被引量：1
4于劲松,时祎瑜,梁爽,唐荻音.基于狄利克雷混合模型的刀具磨损量在线估计[J].仪器仪表学报,2017,38(3):689-694. 被引量：8
5陈兴蜀,马晨曦,王文贤,高悦,王海舟.基于改进的ccLDA多数据源热点话题检测模型[J].工程科学与技术,2018,50(2):141-147. 被引量：4
6徐立洋,黄瑞章,陈艳平,钱志森,黎万英.基于狄利克雷多项分配模型的多源文本主题挖掘模型[J].计算机应用,2018,38(11):3094-3099. 被引量：1
7谌裕勇.云存储中心多源文本主题融合模型研究[J].智能计算机与应用,2019,9(2):148-151. 被引量：2
8沈喆,王毅,姚毅凡,成颖.面向学术文献的作者名消歧方法研究综述[J].数据分析与知识发现,2020,4(8):15-27. 被引量：11
9陈仲.基于狄利克雷过程混合模型的城市活动聚类方法研究[J].交通运输系统工程与信息,2020,20(6):247-252.
10曹思萌,李春旺.作者名称增量消歧研究综述[J].数据分析与知识发现,2022,6(5):10-19. 被引量：1

二级引证文献44

1李佳欣,苏曙光.基于BERT的图像和文本多模态融合分类模型[J].计算机应用,2023,43(S01):39-44. 被引量：4
2冯国震.基于模式匹配与机器学习的异常检测模型[J].中国科技纵横,2018,0(7):9-13.
3邸宏宇,张静,于毅,王连印.一种基于改进模糊哈希的文件比较算法研究[J].信息网络安全,2016(11):12-18. 被引量：3
4高川,严寒冰,贾子骁.基于特征的网络漏洞态势感知方法研究[J].信息网络安全,2016(12):28-33. 被引量：10
5任浩,罗森林,潘丽敏,高君丰.基于图结构的文本表示方法研究[J].信息网络安全,2017(3):46-52. 被引量：5
6徐威扬,李尧,唐勇,王宝生.一种跨指令架构二进制漏洞搜索技术研究[J].信息网络安全,2017(9):21-25. 被引量：3
7王媛媛,范潮钦,苏玉海.面向聊天记录的语义分析研究[J].信息网络安全,2017(9):89-92. 被引量：3
8GUL Khan Safi Qamas,尹继泽,潘丽敏,罗森林.基于深度神经网络的命名实体识别方法研究[J].信息网络安全,2017(10):29-35. 被引量：16
9黄娜娜,万良,邓烜堃,易辉凡.一种基于序列最小优化算法的跨站脚本漏洞检测技术[J].信息网络安全,2017(10):55-62. 被引量：5
10吴天松,胡蓉,鲁彦志.采棉机摘锭磨损程度的数字图像法研究[J].机械研究与应用,2017,30(6):159-162. 被引量：2

1姚冬冬,袁方,王煜,刘宇.基于半监督DPMM的新闻话题检测[J].郑州大学学报（理学版）,2016,48(3):63-68. 被引量：2
2王永贵,张旭,刘宪国.基于AT模型的微博用户兴趣挖掘研究[J].计算机工程与应用,2015,51(13):126-130. 被引量：5
3王杰,严建峰,刘晓升,杨璐.HDP采样消息传递算法[J].计算机应用研究,2016,33(7):1994-1998. 被引量：1
4张健伟,严建峰,刘晓升,杨璐.一种基于动态词汇表的在线LDA算法[J].计算机科学,2016,43(12):120-124.
5贾闻俊,张晖,杨春明,赵旭剑,李波.面向产品属性的用户情感模型[J].计算机应用,2016,36(1):175-180. 被引量：4
6席耀一,李弼程,李天彩,黄山奇.基于词语对狄利克雷过程的时序摘要[J].自动化学报,2015,41(8):1452-1460. 被引量：3
7熊祖涛.基于稀疏特征的中文微博短文本聚类方法研究[J].软件导刊,2014,13(1):133-135. 被引量：4
8孙海军.基于MapReduce和网格密度的文本聚类分析研究[J].信息系统工程,2014,27(10):25-26. 被引量：1
9庄世芳.一种基于概念聚类的中文文本类簇主题提取方法[J].电脑知识与技术,2008(4):138-140. 被引量：1
10何莹,秦亮曦.基于PCA的H-K聚类算法研究[J].微计算机信息,2012,28(6):163-165. 被引量：3

信息网络安全

2015年第11期

浏览历史

内容加载中请稍等...

一种基于狄利克雷过程混合模型的文本聚类算法被引量：10

参考文献21

二级参考文献29

共引文献43

同被引文献68

引证文献10

二级引证文献44

相关作者

相关机构

相关主题

浏览历史

一种基于狄利克雷过程混合模型的文本聚类算法 被引量：10

参考文献21

二级参考文献29

共引文献43

同被引文献68

引证文献10

二级引证文献44

相关作者

相关机构

相关主题

浏览历史

一种基于狄利克雷过程混合模型的文本聚类算法被引量：10