期刊文献+

基于LDA模型的文本分割 被引量:54

Text Segmentation Based on Model LDA
在线阅读 下载PDF
导出
摘要 文本分割在信息提取、文摘自动生成、语言建模、首语消解等诸多领域都有极为重要的应用.基于LDA模型的文本分割以LDA为语料库及文本建模,利用MCMC中的Gibbs抽样进行推理,间接计算模型参数,获取词汇的概率分布,使隐藏于片段内的不同主题与文本表面的字词建立联系.实验以汉语的整句作为基本块,尝试多种相似性度量手段及边界估计策略,其最佳结果表明二者的恰当结合可以使片段边界的识别错误率远远低于其它同类算法. Text segmentation is very important for many fields including information retrieval, summarization, language modeling, anaphora resolution and so on. Text segmentation based on LDA models corpora and texts with LDA. Parameters are estimated with Gibbs sampling of MCMC and the word probability is represented. Different latent topics are associated with observable words. In the experiments, Chinese whole sentences are taken as elementary blocks. Variety of similarity metrics and several approaches of discovering boundaries are tried. The best results show the right combination of them can make the error rate far lower than other algorithms of text segmentation.
出处 《计算机学报》 EI CSCD 北大核心 2008年第10期1865-1873,共9页 Chinese Journal of Computers
基金 国家“九七三”重点基础研究发展规划项目基金(2002CB312103) 国家自然科学基金(60503054) 中国科学院软件研究所创新工程重大项目资助~~
关键词 文本分割 LDA模型 相似性度量 边界识别 text segmentation model Latent Dirichlet Allocation (LDA) similarity metric boundaries discovering
  • 相关文献

参考文献23

  • 1Bolshakov Igor A, Gelbukh A. Text segmentation into paragraphs based on local text cohesion//Vdclav Matousek, Pavel Mautner, Roman Moucek, Karel Tauser eds Proceed ings of the Text, Speech and Dialogue(TSD 2001): Lecture Notes in Artificial Intelligence, N 2166. Springer-Verlag, 2001: 158- 166
  • 2Kehagias Ath, Nicolaou A, Fragkou P, Petridis V. Text segmentation by product partition models and dynamic programming. Mathematical and Computer Modelling, 2004, 39:209- 217
  • 3Tur G, Hakkani-Tur D, Stolcke A, Shriberg E. Integrating prosodic and lexical cues for automatic topic segmentation. Computational Linguistics, 2001, 27(1): 31 -57
  • 4Levow Gina Anne. Prosody based topic segmentation for mandarin broadcast news//Proceedings of the HLT-NAACL 2004. Boston, Massachusetts, USA, 2004, 2:137 -140
  • 5Blei D, Moreno P. Topic segmentation with an aspect hidden Markov model//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, Louisiana, USA, 2001: 343-348
  • 6Thorsten Brants, Francine Chen, Ioannis Tsochantaridis. Topic-based document segmentation with probabilistic latent semantic analysis//Proceedings of the llth International Conference on Information and Knowledge Management McLean. Virginia, USA, 2002:211- 218
  • 7Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, (3): 993-1022
  • 8Steyvers M, Griffiths T. Probabilistic topic models//Landauer T, MeNamara D, Dennis S, Kintsch Weds. Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, 2006
  • 9Minka Thomas, Lafferty John. Expectation-propagation for the generative aspect model//Proceedings of the Uncertainty in Artificial Intelligence (UAI). Edmonton, Alberta, Canada, 2002: 352-359
  • 10Wang Xiao-Gang, Grimson Eric. Spatial latent diriehlet allocation//Proceedings of the Neural Information Processing Systems (NIPS2007). Vancouver, B.C. , Canada, 2007

二级参考文献41

  • 1Salton G,Singhal A,Buckley C,Mitra M.Automatic text decomposition using text segments and text themes.In:Bernstein M,Carr L,Osterbye K,eds.Proc.of the 7th ACM Conf.on Hypertext.New York:ACM Press,1996.53-65.
  • 2Hearst MA.TextTiling:Segmenting text into multi-paragraph subtopic passages.Computational Linguistics,1997,23(1):33-64.
  • 3Morris J,Hirst G.Lexical cohesion computed by thesauri relations as an indicator of the structure of text.Computational Linguistics,1991,17(1):21-42.
  • 4Kozima H.Text segmentation based on similarity between words.In:Proc.Of the 31st Annual Meeting of the Association for Computational Linguistics.1993.286-288.Http://acl.ldc.upenn.edu/P/P93/P931041.pdf
  • 5Passoneau RJ,Litman DJ.Intention-Based segmentation:Human reliability and correlation with linguistic cues.In:Proc.Of the 31st Meeting of the Association for Computational Linguistics.1993.148-155.Http://acl.ldc.upenn.edu/P/P93/P931020.pdf
  • 6Reynar JC.Topic segmentation:Algorithms and application[Ph.D.Thesis].Pennsylvania:University of Pennsylvania,1998.
  • 7Ponte JM,Croft WB.Text segmentation by topic.In:Peters C,Thanos C,eds.Proc.of the 1st European Conf.on Research and Advanced Technology for Digital Libraries.Berlin,Heidelberg:Springer-Verlag,1997.120-129.
  • 8Reynar JC.Statistical models for topic segmentation.In:Proc.Of the 37th Annual Meeting of the Association for Computational Linguistics.1999.357-364.Http://acl.ldc.upenn.edu/P/P99/P991046.pdf
  • 9Kauchak D,Chen F.Feature-Based segmentation of narrative documents.In:Proc.Of the 43rd Annual Meeting of the Association for Computational Linguistics.2005.32-39.Http://acl.ldc.upenn.edu/W/W05/W05-04.pdf
  • 10Choi FYY.Advances in domain independent linear text segmentation.In:Proc.Of the North American Chapter of the Association for Computational Linguistics Annual Meeting.Seattle:Association for Computational Linguistics.2000.http://acl.ldc.upenn.edu/A/A00/A002004.pdf

共引文献32

同被引文献774

引证文献54

二级引证文献677

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部