三种主题分割方法的对比研究被引量：2

Research on comparison of three topic segmentation approaches

在线阅读下载PDF

导出

摘要文本分割在信息提取、文摘自动生成、语言建模、首语消解等诸多领域都有极为重要的应用。基于PLSA及LDA模型的文本分割试图使隐藏于片段内的不同主题与文本表面的词、句对建立联系,而基于小世界模型的分割则依据小世界模型的短路径、高聚集性的特点实现片段边界的识别。从模型的特点、分割策略以及实验结果等角度对基于三种模型的分割进行对比。分析表明,基于LDA模型的分割比基于PLSA模型的分割具有更大的稳定性,且分割效果更好。基于小世界模型的分割策略更适合小世界模型特性明显的文本。 Text segmentation is very important for many fields including information retrieval,summarization,language modeling, anaphora resolution and so on.Text segmentation based on PLSA and LDA associates different latent topics with observable pairs of word and sentence.While segmentation based on small world relies on highly clustered feature and character of short path length.The three approaches of segmentation are compared from the theory of model,strategy of segmentation and results of experiments.The analysis shows that segmentation based on LDA is more stable than that based on PLSA and the error rate is lower.The segmentation based on small world is proper for those texts which has more obvious features of small world.

作者石晶李万龙

机构地区长春工业大学计算机科学与工程学院吉林大学计算机科学与技术学院

出处《计算机工程与应用》 CSCD 北大核心 2009年第18期135-138,151,共5页 Computer Engineering and Applications

基金长春工业大学博士基金(No.2008A02)

关键词文本分割概率潜在语义分析模型 LDA模型小世界模型 text segmentation Probabilistic Latent Semantic Analysis （ PLSA ） model Latent Dirichlet Allocation （ LDA ） model small world model

分类号 TP301 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献13

1Bolshakov I A,Gelbukh A.Text segmentation into paragraphs based on local text cohesion[C]//Lecture Notes in Artificial Intelligence, N 2166,Text,Speech and Dialogue(TSD-2001).[S.l.]:Springer-Verlag,2001 : 158-166.
2Kehagias A,Nicolaou A,Fragkou P,et al.Text segmentation by product partition models and dynamic programming[J].Mathematical and Computer Modelling, 2004,39 : 209-217.
3Tur G,Hakkani-Tur D,Stolcke A,et al.Integrating prosodic and lexical cues for automatic topic segmentation[J].Computational Linguistics, 2001,27( 1 ) : 31-57.
4Levow G A.Prosody-based topic segmentation for mandarin broadcast news[C]//Proceedings of HLT-NAACL 2004,2004,2.
5Blei D M,Ng A Y,Jordan M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003(3) :993-1022.
6Griffiths T,Steyvers M.Finding scientific topics[J].Proceedings of the National Academy of Sciences,2004, 101:5228-5235.
7Watts D,Strogatz S.Collective dynamics of small-world networks[J]. Nature, 1998,393 : 440-442.
8Hofmann T.Unsupervised learning by probabilistic latent semantic analysis[J].Machine Learning Journal,2001,42( 1 ) : 177-196.
9Yutaka Matsuo.Clustering using small world structure [C]//Proc 6th Int'l Conf on Knowledge-based Intelligent Information Engineering Systems & Applied Technologies(KES2002),Crema,Italy,September 2002.[S.l.]:IOS Press/Ohmsha,2002:1252-1256.
10Beeferman D,Berger A,Lafferty J.Statistical models for text segmentation[J].Machine Learning, 1999,34:1-34.

二级参考文献38

1索红光,刘玉树,曹淑英.一种基于词汇链的关键词抽取方法[J].中文信息学报,2006,20(6):25-30. 被引量：88
2石晶,戴国忠.基于PLSA模型的文本分割[J].计算机研究与发展,2007,44(2):242-248. 被引量：25
3Igor A Bolshakov,A Gelbukh.Text segmentation into paragraphs based on local text cohesion[G].In:Text,Speech and Dialogue (TSD-2001),Lecture Notes in Artificial Intelligence 2166.Berlin:Springer-Verlag,2001.158-166
4Ath Kehagias,A Nicolaou,P Fragkou,et al.Text segmentation by product partition models and dynamic programming[J].Mathematical and Computer Modelling,2004,39(2-3):209-217
5G Tur,D Hakkani-Tur,A Stolcke,et al.Integrating prosodic and lexical cues for automatic topic segmentation[J].Computational Linguistics,2001,27(1):31-57
6Gina-Anne Levow.Prosody-based topic segmentation for Mandarin broadcast news[C].HLT-NAACL 2004,Boston,Massachusetts,USA,2004
7D Blei,P Moreno.Topic segmentation with an aspect hidden Markov model[C].In:Proc of the 24th Annual Int'l ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM Press,2001.343-348
8Thorsten Brants,Francine Chen,Ioannis Tsochantaridis.Topic-based document segmentation with probabilistic latent semantic analysis[C].The 11th Int'l Conf on Information and Knowledge Management,McLean,Virginia,USA,2002
9F Y Y Choi,P Wiemer-Hastings,J Moore.Latent semantic analysis for text segmentation[C].The 2001 Conf on Empirical Methods in Natural Language Processing,Pittsburgh,PA,USA,2001
10Thomas Hofmann.Probabilistic latent semantic analysis[C].In:Proc of the 15th Annual Conf on Uncertainty in Artificial Intelligence (UAI-99).San Francisco,CA:Morgan Kaufmann,1999.289-296

共引文献32

1石晶,胡明,戴国忠.基于小世界模型的中文文本主题分析[J].中文信息学报,2007,21(3):69-75. 被引量：9
2朱荷香,曲维光,卢俊之,李素建,邵艳秋.面向自动文摘的文本结构划分[J].南京大学学报（自然科学版）,2008,44(2):204-211. 被引量：2
3钟茂生,胡熠,刘磊.基于词典词语量化关系的中文文本分割方法[J].计算机工程与应用,2008,44(21):25-29. 被引量：2
4石晶,胡明,石鑫,戴国忠.基于LDA模型的文本分割[J].计算机学报,2008,31(10):1865-1873. 被引量：54
5陈源,陈蓉,胡俊锋,林霖,张靖波,于中华.面向概括性小文本的文本分割算法[J].计算机工程,2008,34(22):43-45. 被引量：1
6刘铭,王晓龙,刘远超.基于主题分析的文本分割技术研究[J].电子学报,2009,37(2):278-284. 被引量：6
7刘玮,陈新武,田金文.目标语义概率模型在类目标识别和地物场景分析中的算法研究[J].计算机科学,2009,36(7):273-277.
8赵煜,蔡皖东,樊娜,刘念.采用并行遗传算法的文本分割研究[J].西安交通大学学报,2009,43(12):40-44. 被引量：1
9石晶,范猛,李万龙.基于LDA模型的主题分析[J].自动化学报,2009,35(12):1586-1592. 被引量：34
10钟将,刘杰.一种基于文本分类的知识树自动构建方法[J].计算机应用研究,2010,27(2):475-478. 被引量：4

同被引文献41

1傅间莲,陈群秀.自动文摘系统中的主题划分问题研究[J].中文信息学报,2005,19(6):28-35. 被引量：13
2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：391
3Labadi6 A, Prince V. Finding text boundaries and finding topic boundaries: two different tasks // Nordstr6m B, Ranta A. GoTAL 2008. Gothenburg, 2008, 5221:260-271.
4Brown G, Yule G. Discourse analysis, Cambridge textbooks in linguistics series. Britain: Cambridge University Press, 1983.
5Reynar J. Statistical models for topic segmentation// Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. New Jersey, 1999: 357-364.
6Kern R, Granitzer M. Efficient linear text segmen- tation based on information retrieval techniques // MEDES. Lyon, 2009:167-171.
7Hearst M. TextTiling: segmenting text into multi- paragraph subtopic passages. Computational Linguis- tics, 1997, 23(1): 33-64.
8Beeferman D, Berger A, Lafferty J. Statistical models for text segmentation. Machine Learning, 1999, 34: 177-210.
9Halliday M A K, Hasan R. Cohesion in English. London: Longman, 1976.
10Hearst M A, Plaunt C. Subtopic structuring for full-length document access//Proceedings of the 16th Annual International ACM/SIGIR. Pittsburgh, 1993: 59-68.

引证文献2

1童毅见,唐慧丰.面向自动文摘的主题划分方法[J].北京大学学报（自然科学版）,2013,49(1):39-44. 被引量：5
2史庆伟,从世源.基于mRMR和LDA主题模型的文本分类研究[J].计算机工程与应用,2016,52(5):127-133. 被引量：8

二级引证文献13

1王洋洋,刘柏嵩,刘薇.基于归一化割的主题划分算法研究[J].宁波大学学报（理工版）,2013,26(4):40-44. 被引量：2
2王萌,唐新来,何婷婷.一种文本分割技术的多文档文摘方法研究[J].计算机应用与软件,2014,31(9):40-44. 被引量：2
3王荣波,张璐瑶,李杰,黄孝喜,周昌乐.基于句群的自动文摘方法[J].计算机应用,2016,36(A01):58-62. 被引量：2
4戚后林,顾磊.概率潜在语义分析的KNN文本分类算法[J].计算机技术与发展,2017,27(7):57-61. 被引量：3
5骆俊帆,陈黎,于中华,丁革建,罗谦.长度分布约束下的摘要文本无监督分割算法[J].中文信息学报,2017,31(4):138-144. 被引量：2
6李湘东,阮涛,刘康.基于维基百科的多种类型文献自动分类研究[J].数据分析与知识发现,2017,1(10):43-52. 被引量：11
7李惠富,陆光,景维鹏.文本分类中基于K-Sprinkling的特征提取方法[J].计算机工程,2017,43(12):141-146. 被引量：2
8赵乐,张兴旺.面向LDA主题模型的文本分类研究进展与趋势[J].计算机系统应用,2018,27(8):10-18. 被引量：8
9何天文,王红,刘海燕.基于词语相关性的对话系统话题分割[J].计算机应用研究,2019,36(4):1010-1014. 被引量：3
10张双祥.HDFS模式下基于用户兴趣的教学信息化资源管理方法[J].现代电子技术,2019,42(11):87-89. 被引量：2

1田甜,张振国.一种基于PLSA和词袋模型的图像分类新方法[J].咸阳师范学院学报,2010,25(4):50-55. 被引量：1
2胡玲玲,杨寿保,王菁.P2P网络中Sybil攻击的防御机制[J].计算机工程,2009,35(15):121-123. 被引量：3
3桂舒婷,郑烇,周乐乐,刘欣,王嵩.基于小世界模型的高维索引算法[J].计算机工程与应用,2015,51(16):136-141. 被引量：2
4石晶,戴国忠.基于PLSA模型的文本分割[J].计算机研究与发展,2007,44(2):242-248. 被引量：25
5陈登科,孔繁胜.基于高斯pLSA模型与项目的协同过滤混合推荐[J].计算机工程与应用,2010,46(23):209-211. 被引量：5
6张成,曲明成,倪宁,仇光,卜佳俊.基于概率潜在语义分析模型的自动答案选择[J].计算机工程,2011,37(14):70-72. 被引量：5
7王奕.基于概率潜在语义分析的中文文本分类研究[J].甘肃联合大学学报（自然科学版）,2011,25(4):75-78. 被引量：4
8张玉芳,朱俊,熊忠阳.改进的概率潜在语义分析下的文本聚类算法[J].计算机应用,2011,31(3):674-676. 被引量：14
9谢杰.基于小世界模型对异常数据的分析技术[J].信息安全与通信保密,2014,12(1):75-77.
10崔琳,谈成访,吴孝银.基于概率潜在语义分析的Blog个性化查询扩展研究[J].安阳师范学院学报,2013(2):39-42.

计算机工程与应用

2009年第18期

浏览历史

内容加载中请稍等...

三种主题分割方法的对比研究被引量：2

参考文献13

二级参考文献38

共引文献32

同被引文献41

引证文献2

二级引证文献13

相关作者

相关机构

相关主题

浏览历史

三种主题分割方法的对比研究 被引量：2

参考文献13

二级参考文献38

共引文献32

同被引文献41

引证文献2

二级引证文献13

相关作者

相关机构

相关主题

浏览历史

三种主题分割方法的对比研究被引量：2