期刊文献+

一种基于狄利克雷过程混合模型的文本聚类算法 被引量:10

A Document Clustering Algorithm Based on Dirichlet Process Mixture Model
在线阅读 下载PDF
导出
摘要 随着互联网的普及,论坛、微博、微信等新媒体已经成为人们获取和发布信息的重要渠道,而网络中的这些文本数据,由于文本数目和内容的不确定性,给网络舆情聚类分析工作带来了很大的挑战。在文本聚类分析中,选择合适的聚类数目一直是一个难点。文章提出了一种基于狄利克雷过程混合模型的文本聚类算法,该算法基于非参数贝叶斯框架,可以将有限混合模型扩展成无限混合分量的混合模型,使用狄利克雷过程中的中国餐馆过程构造方式,实现了基于中国餐馆过程的狄利克雷混合模型,然后采用吉布斯采样算法近似求解模型,能够在不断的迭代过程中确定文本的聚类数目。实验结果表明,文章提出的聚类算法,和经典的K-means聚类算法相比,不仅能更好的动态确定文本主题聚类数目,而且该算法的聚类质量(纯度、F-score和轮廓系数)明显好于K-means聚类算法。 With the prevalence of Internet, network forum, microblog, WeChat, etc are an important channel for people to obtain and publish information. However, the uncertainty of the documents quantity and content brings great challenge for Internet public opinion analysis. In document clustering, choosing a right clustering number is a hard task. In this paper, a document clustering algorithm based on Dirichlet process mixture model (DCA-DPMM) was proposed. DCA-DPMM could extends standard ifnite mixture models to an infinite number of mixture components, using CRP(Chinese restaurant process) of the Dirichlet Process, this paper implement Dirichlet process mixture model based on CRP. The clustering assignment of data points could be sampled at different iterations by the Gibbs sampling algorithm. The experiments results showed that the proposed document clustering algorithm, compared with classical K-means clustering algorithm, not only could determine the clustering number dynamically, but also can improve the clustering quality such as purity, F-score and silhouette coefifcient.
出处 《信息网络安全》 2015年第11期60-65,共6页 Netinfo Security
基金 国家科技支撑计划[2012BAH18B05] 国家自然科学基金[61272447]
关键词 文本聚类 狄利克雷过程混合模型 非参数贝叶斯 吉布斯采样 document clustering Dirichlet process mixture model Bayesian nonparametrics Gibbs sampling
  • 相关文献

参考文献21

  • 1Jiawei H, Kamber M. Data Mining Concepts and Techniques (Third Edition) [M]. San Francisco: Morgan Kaufmann, 2011.
  • 2Hartigan J A, Wong M A. Algorithm AS 136: A k-means Clustering Algorithm [J]. Applied Statistics, 1979, 28(1): 100-108.
  • 3Bouman C A, Shapiro M, Cook G W, et al. Cluster: An Unsupervised Algorithm for Modeling Gaussian Mixtures [EB/OL].Online available: https://engineering.purdue.edu/-bouman/software/cluster/manual.pdf, 2015-8-16.
  • 4Sharif-lKazavian N, Zollmann A. An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing [EB/OL]. Online available: http://www.cs.cmu.edu/-zoUmann/publications/nonparametric. pdf,, 2015-8-16.
  • 5Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes [J]. Journal of the American Statistical Association, 2006, 101(476): 1566- 1581.
  • 6Vlachos A, Korhonen A, Ghahramani Z. Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering[C]// Proceedings of the Workshop on Geometrical Models of Natural Language Semantics. Association for Computational Linguistics, 2009: 74-82.
  • 7Fox E B, Choi D S, Willsky A S. Nonparametric Bayesian Methods for Large Scale Multi-target Tracking[C]//ACSSC'06. Fortieth Asilomar Conference on Signals, Systems and Computers. IEEE, 2006: 2009- 2013.
  • 8Zhang Z H, Dai G, Jordan M I. Matrix-variate Dirichlet Process Mixture Models[C]// Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Sardinia, Italy: The MIT Press, 2010:980-987.
  • 9Ferguson T S. A Bayesian Analysis of Some Nonparametric Problems [J]. The Annals of Statistics, 1973, 1(2): 209-230.
  • 10Teh Y W. Dirichlet Process [M]. Springer US, 2010.

二级参考文献29

  • 1刘毅.略论网络舆情的概念、特点、表达与传播[J].理论界,2007(1):11-12. 被引量:317
  • 2夏士雄,李文超,周勇,张磊,牛强.Improved k-means clustering algorithm[J].Journal of Southeast University(English Edition),2007,23(3):435-438. 被引量:16
  • 3ZHANG C, XIA S. K-means clustering algorithm with improvedinitial center[C]//Second International Workshop on, IEEE, 2009:790-792.
  • 4GREEN R, STAFFELL I, VASILAKOS N. Divide and conquerk-means clustering of demand data allows rapid and accurate simulationsof the British electricity systemQ]. IEEE TRANSACTIONS ONENGINEERING MANAGEMENT, 2014,61(2): 251-260.
  • 5DIN W I S W, YAHYA S, TAIB M N, et al. MAP: The newclustering algorithm based on multitier network topology to prolongthe lifetime of wireless sensor network[C]//Signal Processing & itsApplications (CSPA), 2014 IEEE 10th International Colloquium on, IEEE,2014: 173-177.
  • 6JI T, BAO X,WANG Y, et al. A Fuzzy K-modes-based Algorithmfor Soft Subspace Clustering[C] //Fuzzy Systems and Knowledge Discovery(FSKD), 2011 Eighth International Conference on, IEEE, 2011,2: 1080-1084.
  • 7Lain, Chuck. Hadoop in action[M]. Greenwich ,Connecticut: Manning Publications Co., 2010.
  • 8McQueen J. Some methods for classification and analysis of multivariate observations[C]//Proc, of the 5th Berkeley Symp. On Math. Stat. and Prob. 1967,(1):281-296.
  • 9Dean, Jeffrey, Sanjay Ghemawat. MapReduce: simplified data processing on large clusters[C]//Communications of the ACM 51, 2008,(1): 107-113.
  • 10McCallum A, Nigam K, Lyle H. Ungar: efficient clustering of high- dimensional data sets with application to reference matching[C]//Proc, of the 6th ACM SIGKDD, 2000:169-178.

共引文献43

同被引文献68

  • 1余传明,钟韵辞,林奥琛,安璐.基于网络表示学习的作者重名消歧研究[J].数据分析与知识发现,2020,4(2):48-59. 被引量:11
  • 2徐晓日.网络舆情事件的应急处理研究[J].华北电力大学学报(社会科学版),2007(1):89-93. 被引量:142
  • 3徐创文,陈花玲,郭攀成,严慧萍.基于径向基函数网络的刀具磨损识别[J].测试技术学报,2007,21(3):219-224. 被引量:2
  • 4WANGMeng, LIN Lanfen, WANG Jing, et al. Improving Short TextClassification Using Public Search Engines[C]//NAFOSTED, Springer.2013 International Symposium on Integrated Uncertainty in KnowledgeModelling and Decision Making, July 12-14,2013,Beijing, China.Heidelberg: Springer, 2013: 157-166.
  • 5NINGYahui, ZHANG Li, JU Yarong, et al. Using SemanticCorrelation of HowNet for Short Text Classification [EB/OL].https://www.researchgate.net/publication/269359136_Using_Semantic_Correlation_of_HowNet_for_Short_Text_Classification,2016-1-22.
  • 6BEAUXB P. SHARIFI, INOUYE D I, KALITA J K.Summarization of Twitter Microblogs [J]. The Computer Journal, 2014,57(3): 378-402.
  • 7LEQ, MIKOLOV T. Distributed Representations of Sentences andDocuments [CJ//IMLS. 31st International Conference on MachineLearning, June 21-26,2014, Beijing, China. Los Alamos: Eprint Arxiv,2014: 1188-1196.
  • 8BLEID M, NG A Y, JORDAN A Y. Latent Dirichlet Allocation[J].The Journal of Machine Learning Research, 2003, 3(1): 993-1022.
  • 9VOD T, OCK C Y. Learning to Classify Short Text from ScientificDocuments Using Topic Models with Various Types of" Knowledge [J].Expert Systems with Applications, 2015, 42(3): 1684-1698.
  • 10CHANG C C, LIN C J. LIBSVM: A Library for Support VectorMachines[J].ACM Transactions on Intelligent Systems and Technology,2011,2(3): 1-27.

引证文献10

二级引证文献44

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部