期刊文献+

基于半监督学习的多源异构数据治理 被引量:5

Multi-source Heterogeneous Data Governance Based on Semi-supervised Learning
在线阅读 下载PDF
导出
摘要 为实现不同数据管理系统之间的互通,提出一种基于半监督学习算法的多源异构数据治理框架,并由此设计、实现和测试了一套非结构化数据与结构化数据的自动化对齐方法。利用命名实体识别(NER)技术,将非结构化数据转化为结构化数据,再分别利用基于字符串相似度的方法和基于监督学习的方法,对结构化数据进行模式匹配;通过半监督学习方法,在结构化数据与数据库记录实体之间进行实体匹配与融合;利用自然语言处理(NLP)技术及深度学习方法,对融合后的数据集进行缺失值填补。结果表明:在论文数据集和视频元数据集上进行对齐处理后,两者的F1值分别达到89.70%及96.50%;在不同属性上进行缺失值填补后,整体填补准确率达到78%以上,大大优于基线方法的准确率。 In order to realize the intercommunication between different data management systems, we proposed a framework of multi-source heterogeneous data governance based on semi-supervised learning.Then,we designed,implemented and tested an automatic alignment method of unstructured data and structured data. The named entity recognition(NER)technology was firstly employed in the framework to convert the unstructured data into the structured one,and the stringsimilarity-based method and supervised-learning-based method were respectively used for the schema matching of structured data. With the semi-supervised learning method,the structured data and its corresponding entity in database were matched and integrated. Finally,natural language processing(NLP)technology and deep learning methods were used to impute missing values in the integrated dataset. It is shown that the F1-scores for the alignment on the paper dataset and video metadata set are89.70% and 96.50%,respectively;and that the accuracy of missing value imputation on different attributes is all above 78%,which is a great improvement compared with the baseline methods.
作者 饶卫雄 高宏业 林程 赵钦佩 叶丰 RAO Weixiong;GAO Hongye;LIN Cheng;ZHAO Qinpei;YE Feng(School of Software Engineering,Tongji University,Shanghai 201804,China;National Key Laboratory for Complex Systems Simulation,Beijing 100101,China)
出处 《同济大学学报(自然科学版)》 EI CAS CSCD 北大核心 2022年第10期1392-1404,共13页 Journal of Tongji University:Natural Science
基金 上海市科技重大专项(2021SHZDZX0100) 中央高校基本科研业务费专项资金。
关键词 半监督学习 数据治理 多源异构数据 缺失值填补 命名实体识别(NER) semi-supervised learning data governance multi-source heterogeneous data missing data imputation named entity recognition(NER)
  • 相关文献

参考文献2

二级参考文献1

共引文献150

同被引文献76

引证文献5

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部