摘要
文本分割在信息提取、文摘自动生成、语言建模、首语消解等诸多领域都有极为重要的应用。基于PLSA及LDA模型的文本分割试图使隐藏于片段内的不同主题与文本表面的词、句对建立联系,而基于小世界模型的分割则依据小世界模型的短路径、高聚集性的特点实现片段边界的识别。从模型的特点、分割策略以及实验结果等角度对基于三种模型的分割进行对比。分析表明,基于LDA模型的分割比基于PLSA模型的分割具有更大的稳定性,且分割效果更好。基于小世界模型的分割策略更适合小世界模型特性明显的文本。
Text segmentation is very important for many fields including information retrieval,summarization,language modeling, anaphora resolution and so on.Text segmentation based on PLSA and LDA associates different latent topics with observable pairs of word and sentence.While segmentation based on small world relies on highly clustered feature and character of short path length.The three approaches of segmentation are compared from the theory of model,strategy of segmentation and results of experiments.The analysis shows that segmentation based on LDA is more stable than that based on PLSA and the error rate is lower.The segmentation based on small world is proper for those texts which has more obvious features of small world.
出处
《计算机工程与应用》
CSCD
北大核心
2009年第18期135-138,151,共5页
Computer Engineering and Applications
基金
长春工业大学博士基金(No.2008A02)
关键词
文本分割
概率潜在语义分析模型
LDA模型
小世界模型
text segmentation
Probabilistic Latent Semantic Analysis ( PLSA ) model
Latent Dirichlet Allocation ( LDA ) model
small world model