期刊文献+

三种主题分割方法的对比研究 被引量:2

Research on comparison of three topic segmentation approaches
在线阅读 下载PDF
导出
摘要 文本分割在信息提取、文摘自动生成、语言建模、首语消解等诸多领域都有极为重要的应用。基于PLSA及LDA模型的文本分割试图使隐藏于片段内的不同主题与文本表面的词、句对建立联系,而基于小世界模型的分割则依据小世界模型的短路径、高聚集性的特点实现片段边界的识别。从模型的特点、分割策略以及实验结果等角度对基于三种模型的分割进行对比。分析表明,基于LDA模型的分割比基于PLSA模型的分割具有更大的稳定性,且分割效果更好。基于小世界模型的分割策略更适合小世界模型特性明显的文本。 Text segmentation is very important for many fields including information retrieval,summarization,language modeling, anaphora resolution and so on.Text segmentation based on PLSA and LDA associates different latent topics with observable pairs of word and sentence.While segmentation based on small world relies on highly clustered feature and character of short path length.The three approaches of segmentation are compared from the theory of model,strategy of segmentation and results of experiments.The analysis shows that segmentation based on LDA is more stable than that based on PLSA and the error rate is lower.The segmentation based on small world is proper for those texts which has more obvious features of small world.
作者 石晶 李万龙
出处 《计算机工程与应用》 CSCD 北大核心 2009年第18期135-138,151,共5页 Computer Engineering and Applications
基金 长春工业大学博士基金(No.2008A02)
关键词 文本分割 概率潜在语义分析模型 LDA模型 小世界模型 text segmentation Probabilistic Latent Semantic Analysis ( PLSA ) model Latent Dirichlet Allocation ( LDA ) model small world model
  • 相关文献

参考文献13

  • 1Bolshakov I A,Gelbukh A.Text segmentation into paragraphs based on local text cohesion[C]//Lecture Notes in Artificial Intelligence, N 2166,Text,Speech and Dialogue(TSD-2001).[S.l.]:Springer-Verlag,2001 : 158-166.
  • 2Kehagias A,Nicolaou A,Fragkou P,et al.Text segmentation by product partition models and dynamic programming[J].Mathematical and Computer Modelling, 2004,39 : 209-217.
  • 3Tur G,Hakkani-Tur D,Stolcke A,et al.Integrating prosodic and lexical cues for automatic topic segmentation[J].Computational Linguistics, 2001,27( 1 ) : 31-57.
  • 4Levow G A.Prosody-based topic segmentation for mandarin broadcast news[C]//Proceedings of HLT-NAACL 2004,2004,2.
  • 5Blei D M,Ng A Y,Jordan M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003(3) :993-1022.
  • 6Griffiths T,Steyvers M.Finding scientific topics[J].Proceedings of the National Academy of Sciences,2004, 101:5228-5235.
  • 7Watts D,Strogatz S.Collective dynamics of small-world networks[J]. Nature, 1998,393 : 440-442.
  • 8Hofmann T.Unsupervised learning by probabilistic latent semantic analysis[J].Machine Learning Journal,2001,42( 1 ) : 177-196.
  • 9Yutaka Matsuo.Clustering using small world structure [C]//Proc 6th Int'l Conf on Knowledge-based Intelligent Information Engineering Systems & Applied Technologies(KES2002),Crema,Italy,September 2002.[S.l.]:IOS Press/Ohmsha,2002:1252-1256.
  • 10Beeferman D,Berger A,Lafferty J.Statistical models for text segmentation[J].Machine Learning, 1999,34:1-34.

二级参考文献38

  • 1索红光,刘玉树,曹淑英.一种基于词汇链的关键词抽取方法[J].中文信息学报,2006,20(6):25-30. 被引量:88
  • 2石晶,戴国忠.基于PLSA模型的文本分割[J].计算机研究与发展,2007,44(2):242-248. 被引量:25
  • 3Igor A Bolshakov,A Gelbukh.Text segmentation into paragraphs based on local text cohesion[G].In:Text,Speech and Dialogue (TSD-2001),Lecture Notes in Artificial Intelligence 2166.Berlin:Springer-Verlag,2001.158-166
  • 4Ath Kehagias,A Nicolaou,P Fragkou,et al.Text segmentation by product partition models and dynamic programming[J].Mathematical and Computer Modelling,2004,39(2-3):209-217
  • 5G Tur,D Hakkani-Tur,A Stolcke,et al.Integrating prosodic and lexical cues for automatic topic segmentation[J].Computational Linguistics,2001,27(1):31-57
  • 6Gina-Anne Levow.Prosody-based topic segmentation for Mandarin broadcast news[C].HLT-NAACL 2004,Boston,Massachusetts,USA,2004
  • 7D Blei,P Moreno.Topic segmentation with an aspect hidden Markov model[C].In:Proc of the 24th Annual Int'l ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM Press,2001.343-348
  • 8Thorsten Brants,Francine Chen,Ioannis Tsochantaridis.Topic-based document segmentation with probabilistic latent semantic analysis[C].The 11th Int'l Conf on Information and Knowledge Management,McLean,Virginia,USA,2002
  • 9F Y Y Choi,P Wiemer-Hastings,J Moore.Latent semantic analysis for text segmentation[C].The 2001 Conf on Empirical Methods in Natural Language Processing,Pittsburgh,PA,USA,2001
  • 10Thomas Hofmann.Probabilistic latent semantic analysis[C].In:Proc of the 15th Annual Conf on Uncertainty in Artificial Intelligence (UAI-99).San Francisco,CA:Morgan Kaufmann,1999.289-296

共引文献32

同被引文献41

  • 1傅间莲,陈群秀.自动文摘系统中的主题划分问题研究[J].中文信息学报,2005,19(6):28-35. 被引量:13
  • 2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:391
  • 3Labadi6 A, Prince V. Finding text boundaries and finding topic boundaries: two different tasks // Nordstr6m B, Ranta A. GoTAL 2008. Gothenburg, 2008, 5221:260-271.
  • 4Brown G, Yule G. Discourse analysis, Cambridge textbooks in linguistics series. Britain: Cambridge University Press, 1983.
  • 5Reynar J. Statistical models for topic segmentation// Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. New Jersey, 1999: 357-364.
  • 6Kern R, Granitzer M. Efficient linear text segmen- tation based on information retrieval techniques // MEDES. Lyon, 2009:167-171.
  • 7Hearst M. TextTiling: segmenting text into multi- paragraph subtopic passages. Computational Linguis- tics, 1997, 23(1): 33-64.
  • 8Beeferman D, Berger A, Lafferty J. Statistical models for text segmentation. Machine Learning, 1999, 34: 177-210.
  • 9Halliday M A K, Hasan R. Cohesion in English. London: Longman, 1976.
  • 10Hearst M A, Plaunt C. Subtopic structuring for full-length document access//Proceedings of the 16th Annual International ACM/SIGIR. Pittsburgh, 1993: 59-68.

引证文献2

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部