摘要
交集型歧义是中文分词的一大难题,构建大规模高频最大交集型歧义字段(MOAS)的数据库,对于掌握其分布状况和自动消歧都具有重要意义。本文首先通过实验指出,与FBMM相比,全切分才能检测出数量完整、严格定义的MOAS,检测出的MOAS在数量上也与词典规模基本成正比。然后,在4亿字人民日报语料中采集出高频MOAS14906条,并随机抽取了1354270条带有上下文信息的实例进行人工判定。数据分析表明,约70%的真歧义MOAS存在着强势切分现象,并给出了相应的消歧策略。
Overlapping ambiguity is still an open issue in Chinese word segmentation. This paper makes a deep investigation on Maximal Overlapping Ambiguity String (MOAS). First, we discuss the disadvantage of using FBMM to detect OAS. Then, by word omni-segmentafion, we collect 14906 high frequent MOASs from People's Daily corpus which contains about 400M characters. For these MOASs, 1354270 sample sentences are randomly selected and manually labeled. The results show that about 70% of MOASs with true ambiguity have a strong bias towards one segmentation, and consequently, a disambiguation strategy fon dealing with overlapping ambiguities is put forward.
出处
《中文信息学报》
CSCD
北大核心
2006年第1期1-6,共6页
Journal of Chinese Information Processing
基金
南京师范大学211资助项目(1240702504)
关键词
计算机应用
中文信息处理
最大交集型歧义字段
全切分
强势切分
computer application
Chinese information processing
maximal overlapping ambiguity siring
word omni-segmentation
biased segmentation