摘要
医学领域文本存在大量的专业词汇,相比于通用领域更容易出现分词错误和未登录词的问题,其结果会导致上下文语义缺失,并影响命名实体识别(NER)的准确率。为了解决上述问题,本文提出了引入词汇信息的基于门控循环单元的中文医学命名实体识别模型WI-NER。首先,基于中文医学数据集的特点,描述了中文医学领域的命名实体识别的任务定义、实体位置和实体类别标签,并将模型在嵌入层对匹配专业词的字符进行特征嵌入与向量融合;其次,在上下文编码层添加词汇门控单元,利用循环神经网络的记忆与遗忘机制,自动提取实体识别所需的特征,并通过引入词汇信息和先验知识,实现了中文医学命名实体识别效果的提升;最后,对本模型在3个数据集上进行了实验验证,结果表明,本文提出的中文医学命名实体识别模型在准确率方面优于基线模型,达到了预期的医学领域特性。
There are a large number of specialized words in medical texts,which are more prone to word segmentation errors and unregistered words than in general fields,resulting in the loss of contextual semantics and affecting the accuracy of named entity recognition(NER).In order to solve the above problems,WI-NER,a Chinese medical named entity recognition model based on gated circulation unit with lexical information,is proposed in this paper.Firstly,on the basis of the characteristics of Chinese medical data set,the task definition,entity location and entity category label of named entity recognition in Chinese medical field are described.In addition,the model performs feature embedding and vector fusion on the characters matching professional words in the embedding layer.Secondly,a lexical gating unit is added to the context coding layer,and the features required for entity recognition are automatically extracted by using the memory and forgetting mechanism of recurrent neural networks.By introducing lexical information and prior knowledge,the recognition effect of Chinese medical named entities is improved.Finally,the model is verified by experiments on three datasets,and the results show that the accuracy of the Chinese medical named entity recognition model proposed in this paper is better than that of the baseline model,achieving the expected characteristics in the medical field.
作者
陈晶
孙亚轩
邢珂萱
CHEN Jing;SUN Yaxua;XING Kexuan(School of Electronics and Information Engineering,Guangdong Ocean University,Zhanjiang 524088;School of Information Science and Engineering,Yanshan University,Qinhuangdao 066004;Key Laboratory of Virtual Technology and System Integration,Yanshan University,Qinhuangdao 066004)
出处
《高技术通讯》
CAS
北大核心
2024年第10期1058-1069,共12页
Chinese High Technology Letters
基金
国家自然科学基金(62172352,61871465,42306218)
中央政府引导地方科技发展基金(226Z0102G,226Z0305G)
河北省自然科学基金(2022203028)
广东海洋大学科研启动基金(060302102304)资助项目。
关键词
中文医学命名识别
先验知识
嵌入层
门控单元
词汇信息
Chinese medical naming recognition
prior knowledge
embedding layer
gated unit
vocabulary information