期刊文献+

一种基于词共现图的文档主题词自动抽取方法 被引量:30

A Kind of Automatic Text Keyphrase Extraction Method Based on Word Co-occurrence
在线阅读 下载PDF
导出
摘要 主题词抽取是文本自动处理的基础性工作.在对现有主题词抽取方法深入研究的基础上,提出了一种基于词共现图的文档主题词自动抽取方法;该方法以基于词频统计方法为基础,利用在词共现图形成的主题信息以及不同主题间的连接特征信息自动地提取文档中的主题词,旨在找出一些非高频词且又对主题贡献大的词.实验表明了该抽取方法抽取出的主题词更能准确地符合了作者的主题. Advances in high-volume storage media have led to an explosion in the amount of machine readable text. Keyphrase extraction is one of the fundamental works of natural language processing. In this paper, a novel automatic text keyphrase extraction method based on word co-occurrence is put forward on the basis of the research of existing keyphrase extraction method. The method, based on word frequency statistics utilizes text subject information based on word co-occurrence graph and linkage information of different text subjects. Our goal is to extract keyphrases with content most accurately matching specific and unique interest of the user. This algorithm for extracting keyphrases represents the asserted main point in a document, without relying on external devices such as natural language processing tools or a document corpus. Our algorithm is based on the segmentation of a graph, representing the co occurrence between terms in a document, into clusters. Each cluster corresponds to a concept on which author' s idea is based, and the top ranked terms on statistical basis. The relationship between each term to these clusters is selected as keyphrases. The experimental results show that thus extracted terms match author's point quite accurately, even though this method does not use the average frequency of each term in a corpus, i.e., this method is a content sensitive, domain independent device of indexing. Its purpose finds the words of nonfrequeney but great contribution to text subject. The concepts or ideas. greatest benefit is the extraction of nonfrequency words which carry the effect of the document, i. e. , preseuted by the author. This merit can lead to the satisfaction of search engine users with unique interests.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2006年第2期156-162,共7页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金(70171052 90104030) 安徽省教育厅自然科学基金(2005kj009zd)
关键词 自然语言处理 词共现图 主题词 TFIDF natural language processing, word co-occurrence graph, keyphrase, term-frequency in verse-document-frenquency (TFIDF)
  • 相关文献

参考文献11

  • 1赵一唯,王和珍,李振东.WWW信息检索综述[J].南京大学学报(自然科学版),2001,37(2):192-198. 被引量:9
  • 2Luhn H P. A statistical approach to the mechanized encoding and searching of literary information. IBM Journal of Research and Development,1957,1(4) : 309-317.
  • 3Luhn H P. The automatic creation of literature abstract. IBM Journal of Research and Development, 1958,2(8). 159-165
  • 4Salton G, Yang C S. On the specification of term values in automatic indexing. Journal of Documentation, 1973,29(4): 351-372.
  • 5Cohen J. Highlights: Language-and domain-in-dependent automatic indexing terms for abstracting. Journal of American Society for Information Science, 1995,46(3): 162-174.
  • 6Written I H, Paynter G W, Frank E, et al.KEA: Practical automatic keyphrase extraction.Proceedings of the Fourth ACM Conference on Digital Libraries. 1999.254-255.
  • 7Tzeras K, Hartmann S. Automatic indexing based on Bayesian inference networks. Procceedins of Interuational ACM SIGIR Conference Research and Development in Information Retrieval, Inference Networks. 1993, 22-34.
  • 8Yutaka M, Yukio O, Mitsuru I. KeyWorld: Extracting keywords in a document as a small world. Proceeding of Discovery Science, 2001,271-281.
  • 9Peat H J, Willet P. The limitations of term cooccurrence data for query expansion in document retrieval systems. Journal of American Society for Information Science, 1991,42(5) : 378-383.
  • 10Chinese Natural Language Processing Platform.http://www.nlp.org.cn/docs/docredirect.php?doc_id=295,2005-03-06.

二级参考文献3

共引文献8

同被引文献328

引证文献30

二级引证文献296

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部