摘要
现有的主题标引方法一般只能抽取文本中出现的词汇,无法从几万或数十万主题词中选择语义关联强且未出现的词汇;基于机器学习的多标签分类算法则需要每一个标签下有训练数据,限制了它们在主题标引上的应用。面向大规模主题词在海量文献上的标引需求,提出一个基于分布式词向量的混合型自动标引方法,利用大规模语料训练的词向量生成同维度的主题词表示向量和文本表示向量,实现主题词与文本语义相似度的计算。基于大规模语料构建主题词与普通词的映射表,使文本向量只和少量的语义强相关主题词向量比较,大大减少了计算量,提高了标引效率。开发的自动标引工具对近亿篇文献进行了主题标引,达到了较高的速度。与结巴关键词的实验对比结果显示,本文方法抽取的主题词与作者关键词重合度较低,且在去除结巴关键词中的非主题词后,取得了比结巴关键词更高的标引准确率;与人工标引的实验对比结果显示,随着人工标引词数量的增加,本文方法的效果、结果与人工标引结果的一致性在不断增加。
Existing subject indexing methods can only extract words that appear in the text but cannot select the words that have strong semantic correlation and do not appear from tens of thousands or hundreds of thousands of subject words.The multi-label text classification algorithm based on machine learning needs training data under each label,limiting its application in subject indexing.Aiming at the indexing requirements of large-scale subject words in massive documents,this study proposes an automatic indexing method based on the distributed word vector technique,which uses the word vector trained by a large-scale corpus to generate representation vectors for subject words and text documents of the same dimension and realizes the calculation of semantic similarities between them.The mapping table between subject and common words is constructed based on a large-scale corpus,so that the text vector is only compared with a small number of semantically strongly related subject word vectors,which significantly reduces the amount of calculation and improves the indexing efficiency.The developed automatic indexing tool has been applied to subject indexing on nearly 100 million documents and has achieved satisfactory speed.Compared with the Jieba keywords,the proposed method has a lower coincidence degree between the subject words and author keywords and achieves better indexing accuracy than the Jieba keywords after removing the non-subject words in the Jieba keywords.
作者
韩红旗
桂婕
张运良
翁梦娟
薛陕
悦林东
Han Hongqi;Gui Jie;Zhang Yunliang;Weng Mengjuan;Xue Shan;Yue Lindong(Institute of Scientific and Technical Information of China,Beijing 100038;Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content,National Press and Publication Administration,Beijing 100038)
出处
《情报学报》
CSSCI
CSCD
北大核心
2022年第5期475-485,共11页
Journal of the China Society for Scientific and Technical Information
基金
中国科学技术信息研究所创新研究基金面上项目“基于论文学科分类的跨学科合作网络研究”(MS2022-04)
中国工程科技知识中心建设项目“知识组织体系建设”(CKCEST-2022-1-29)。
关键词
主题标引
分布式词向量
多标签文本分类
关键词抽取
语义标签
subject indexing
distributed word vector
multi-label text classification
keywords extraction
semantic label