摘要
关键词的抽取广泛应用于自然语言处理过程中.对于中文关键词抽取,分词结果及候选词的选取严重影响后期的抽取结果.针对候选词的选取,提出一种连续单字未登录词识别和多词短语识别的方法来进行候选词选择,可以较好的识别出频率大于1的未登录词,且不依赖于语料库规模和领域.并且,在传统的TF-IDF基础上,结合位置特征和长度特征的情况下,考虑兼类词的不同词性问题,提出改进的TF-IDF计算公式,进行关键词抽取.通过比较实验,证明了候选词对关键词抽取的影响,与TF-IDF进行比较实验,改进的TF-IDF的准确率提高了5%左右.
Keywords extraction is widely used in natural language processing.For Chinese keyword extraction,the selection of candidate words affects the final result of keywords extraction.This paper proposes a method to recognize unknown words that consist of continuous individual chinese characters and muti-words phrases.The method can better identify the unknown word whose frequency is greater than one without depending on the scale and scope of the corpus.Considering of the words with different part of speeches and word's position and length,keywords and key phrases extraction is completed based on a newmethod which adds those features to traditional TF-IDF.With comparision exteriments,it shows that the affection of candidate words.Compared to the traditional TF-IDF,the value of P,R and F of the improved TD-IDF method improves about 5%.
出处
《小型微型计算机系统》
CSCD
北大核心
2016年第4期711-715,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61173100
61173101
61272375)资助