期刊文献+

基于狄利克雷多项分配模型的多源文本主题挖掘模型 被引量:1

Multi-source text topic mining model based on Dirichlet multinomial allocation model
在线阅读 下载PDF
导出
摘要 随着文本数据来源渠道越来越丰富,面向多源文本数据进行主题挖掘已成为文本挖掘领域的研究重点。由于传统主题模型主要面向单源文本数据建模,直接应用于多源文本数据有较多的限制。针对该问题提出了基于狄利克雷多项分配(DMA)模型的多源文本主题挖掘模型——多源狄利克雷多项分配模型(MSDMA)。通过考虑主题在不同数据源的词分布的差异性,结合DMA模型的非参聚类性质,模型主要解决了如下三个问题:1)能够学习出同一个主题在不同数据源中特有的词分布形式;2)通过数据源之间共享主题空间和词项空间,使得数据源间可进行主题知识互补,提升对高噪声、低信息量的数据源的主题发现效果;3)能自主学习出每个数据源内的主题数量,不需要事先给定主题个数。最后通过在模拟数据集和真实数据集的实验结果表明,所提模型比传统主题模型能更有效地对多源数据进行主题信息挖掘。 With the rapid increase of text data sources,topic mining for multi-source text data becomes the research focus of text mining.Since the traditional topic model is mainly oriented to single-source,there are many limitations to directly apply to multi-source.Therefore,a topic model for multi-source based on Dirichlet Multinomial Allocation model(DMA)was proposed considering the difference between sources of topic word-distribution and the nonparametric clustering quality of DMA,namely MSDMA(Multi-Source Dirichlet Multinomial Allocation).The main contributions of the proposed model are as follows:1)it takes into account the characteristics of each source itself when modeling the topic,and can learn the source-specific word distributions of topic k;2)it can improve the topic discovery performance of high noise and low information through knowledge sharing;3)it can automatically learn the number of topics within each source without the need for human pre-given.The experimental results in the simulated data set and two real datasets indicate that the proposed model can extract topic information more effectively and efficiently than the state-of-the-art topic models.
作者 徐立洋 黄瑞章 陈艳平 钱志森 黎万英 XU Liyang;HUANG Ruizhang;CHEN Yanping;QIAN Zhisen;LI Wanying(College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China;Guizhou Provincial Key Laboratory of Public Big Data(Guizhou University),Guiyang Guizhou 550025,China;State Key Laboratory for Novel Software Technology(Nanjing University),Nanjing Jiangsu 210093,China)
出处 《计算机应用》 CSCD 北大核心 2018年第11期3094-3099,3104,共7页 journal of Computer Applications
基金 国家自然科学基金资助项目(61462011) 国家自然科学基金重大研究计划项目(91746116) 贵州省重大应用基础研究项目(黔科合JZ字[2014]2001) 贵州省科技重大专项计划项目(黔科合重大专项字[2017]3002) 贵州省自然科学基金资助项目(黔科合基础[2018]1035)~~
关键词 多源文本数据 主题模型 吉布斯采样 狄利克雷多项分配模型 文本挖掘 multi-source text data topic model blocked-Gibbs sampling Dirichlet Multinomial Allocation(DMA) text mining
  • 相关文献

参考文献2

二级参考文献119

  • 1Mitchell T M. Machine Learning. New York: McGraw-Hill, 1997.
  • 2Teh Y W. Dirichlet processes. Encyclopedia of Machine Learning, Springer, 2010. Part 5, 280-287.
  • 3Teh Y W, Jordan M I. Hierarchical Bayesian nonparametric models with applications. Bayesian Nonparametrics Princi- ples and Practice. Cambridge University Press, 2009. 1-47.
  • 4Teh Y W, Jordan M I, Beal M J, Blei D M. Sharing clus- ters among related groups: hierarchical Dirichlet processes. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, Canada: The MIT Press, 2004. 1385 - 1392.
  • 5Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarchical Dirichlet processes. Journal of the American Statistical As- sociation, 2006, 101(476): 1566-1581.
  • 6Yakhnenko O, Honavar V. Multi-modal hierarchical Dirich- let process model for predicting image annotation and image-object label correspondence. In: Proceedings of the SIAM International Conference on Data Mining. Sparks, USA: SIAM, 2009. 281-294.
  • 7Wang X G, Ma X K, Grimson W E L. Unsupervised activity perception by hierarchical Bayesian models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, USA: IEEE, 2007. 1-8.
  • 8Wang X, Tieu K, Gee-Wah N, Grimson W E L. Trajectory analysis and semantic region modeling using a nonpaxamet- ric Bayesian model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, USA: IEEE, 2008. 1-8.
  • 9Wang X G, Ma X X, Grimson W E L. Unsupervised activity perception in crowded and complicated scenes using hierarchical Bayesian models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(3): 539-555.
  • 10Wang X G, Grimson W E L, Westin C F. Tractography segmentation using a hierarchical Dirichlet processes mixture model. In: Proceedings of the 21st International Conference on Information Processing in Medical Imaging. Williams- burg, USA: Springer, 2009. 101-113.

共引文献48

同被引文献8

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部