期刊文献+

基于大规模语料库的新词检测 被引量:32

New Word Detection Based on Large-Scale Corpus
在线阅读 下载PDF
导出
摘要 自然语言的发展提出了快速跟踪新词的要求.提出了一种基于大规模语料库的新词检测方法,首先在大规模的Internet生语料上进行中文词法切分,然后在分词的基础上进行频度统计得到大量的候选新词.针对二元新词、三元新词、四元新词等的常见模式,用自学习的方法产生3个垃圾词典和一个词缀词典对候选新词进行垃圾过滤,最后使用词性过滤规则和独立词概率技术进一步过滤.据此实现了一个基于Internet的进行在线新词检测的系统,并取得了令人满意的性能.系统已经可以应用到新词检测、术语库建立、热点命名实体统计和词典编纂等领域. New word detection is a part of unknown word detection. The development of natural languages requires us to detect new words as soon as possible. In this paper, a new approach to detect new words based on large-scale corpus is presented. It first segments the corpus from the Internet with ICTCLAS, and searches for repeated strings, and then designs different filtering mechanisms to separate the true new words from the garbage strings, using rich features of various new word patterns. While getting rid of the garbage strings, three garbage lexicons and a suffix lexicon are used, which are learned by the system, and good results are achieved. Finally, the results of the experiments are discussed, which seem to be promising.
出处 《计算机研究与发展》 EI CSCD 北大核心 2006年第5期927-932,共6页 Journal of Computer Research and Development
基金 国家"八六三"高技术研究发展计划基金项目(2004AA114010 2003AA111010) 中国科学院计算技术研究所和富士通研究开发中心有限公司合作项目~~
关键词 新词 垃圾串 垃圾头 垃圾尾 独立词概率 new word garbage string garbage head garbage tail IWP
  • 相关文献

参考文献10

  • 1K.J.Chen,Ming-Hong Bai.Unknown word detection for Chinese by a corpus-based learning method.International Journal of Computational Linguistics and Chinese Language Processing,1998,3 (1):27~44
  • 2K.J.Chen,W.Y.Ma.Unknown word extraction for Chinese documents.The 19th COLING 2002,Taipei,2002
  • 3Jianfeng Gao,Mu Li,Andi Wu,et al.Chinese word segmentation:A pragmatic approach.Microsoft Research,Technical Report:MSR-TR-2004-123,2004
  • 4Nie Jian-Yun,Wanying Jin,Mareie-Louise Hannan.A hybrid approach to unknown word detection and segmentation of Chinese.Int' 1 Conf.Chinese Computing,Singapore,1994
  • 5Hua-Ping Zhang,Qun Liu,Hao Zhang,et al.Automatic recognition of Chinese unknown words based on roles tagging.The 1st SIGHAN Workshop on Chinese Language Processing,Taipei,2002
  • 6郑家恒,李文花.基于构词法的网络新词自动识别初探[J].山西大学学报(自然科学版),2002,25(2):115-119. 被引量:56
  • 7Andi Wu,Zixin Jiang.Statistically-enhanced new word identification in a rule-based Chinese system.The 2nd Chinese Language Processing Workshop,Hong Kong,2000
  • 8Fuchun Peng,Fangfang Feng,Andrew McCallum.Chinese segmentation and new word detection using conditional random fields.COLING 2004,Geneva,Switzerland,2004
  • 9Min-Jer Lee,Chien-Kang Huang,Lee-Feng Chien.Automatic construction of a bilingual live dictionary for spoken language processing applications.Oriental COCOSDA99,Taipei,1999
  • 10邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量:60

二级参考文献4

  • 1郑家恒 李文花.新词语自动识别方法研究.自然语言理解与机器翻译[M].北京:清华大学出版社,2001..
  • 2陆志苇.现代汉语构词法(修订本)[M].北京:中华书局,1975..
  • 3Hua- Ping ZHANG, Qun LIU. et al, Chinese Name Entity Recognition Using Role Model[ J]. Special issue ''Word Formation and Chinese Language processing'' of the International Journal of Computational Linguistics and Chinese Language Processing, 2003, 8(2):2
  • 4Craig G. Nevill - Manning, Ian H. Witten. Identifying Hierarchical Structure in Sequences: A linear - time algorithm [J]. Journal of Artificial Intelligence Research, 1997, 7:67- 82

共引文献97

同被引文献274

引证文献32

二级引证文献179

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部