摘要
复杂数据当前有着广泛的应用.有效地使用复杂数据需要对其质量进行管理.实体识别是数据质量管理的基本操作,用于在数据集合中发现同一实体的不同描述,其在数据质量管理中可以用于错误检测、不一致数据发现等.由于包含复杂的结构信息,复杂数据上的实体识别与传统文本和关系数据上的实体识别不同,带来了新的技术上的挑战.该文介绍了复杂数据上实体识别的概念和应用,分别讨论了XML数据、图数据和复杂网络上实体识别技术的原理,最后展望了未来的研究方向.
It is increasingly common to find data with a complex structure in the real world.To effectively use complex data in practice,necessary techniques must be in place to improve the quality of the data.Entity resolution is a central issue in data quality management for complex objects.It is to find the data objects that refer to the same real-world entity,and to cluster such objects together.It has been proven extremely useful in data fusion,inconsistency detection and in data repairing.Nevertheless,the complex structures of data introduce new challenges and make object identification much harder than record matching on relational data.In response to the new challenges,there has been a lost of work on this topic.This paper aims to provide an overview of recent advances in the study of object identification,on complex objects including XML,graph data and complex networks.For XML data,we survey techniques of pairwise entity and group-wise entity resolution.For graph data,we focus on how to determine whether two graphs refer to the same real-world entity.We also present the metrics and methods for identifying vertexes that pertain to the same real-world entity in a complex network.Finally we discuss directions for future research.
出处
《计算机学报》
EI
CSCD
北大核心
2011年第10期1843-1852,共10页
Chinese Journal of Computers
基金
国家自然科学基金(61003046
61033015
61133002)
RSE-NSFC交流项目(61111130189)
国家"九七三"重点基础研究发展规划项目基金(2012CB316200)
教育部博士点基金(20102302120054)资助~~
关键词
数据质量
复杂数据
实体识别
XML图
复杂网络
data quality
complex data
object identification
XML graph
complex network