摘要
流数据是目前一种重要的数据展现形式,对流数据进行OLAM(联机分析挖掘)操作可为分析人员提供多层次的数据视图。但OLAM要求在不同粒度中实现对数据的聚合操作,而流式数据内含时态特性和持续到达特性,使得数据无法被多次重复操作。使用传统OLAP(联机分析处理)方法无法生成部分物化视图且流数据规模宏大,受限于存储空间大小而无法保存全部数据单元信息。针对上述问题,提出了一种基于概要技术的流数据OLAM框架——sketch cube(概要立方体),该框架把任意维度组合映射成唯一自然数,根据上下限单调原则对维度组合裁剪,在类线性空间中保存有效数据单元信息,并构建时间序列索引提高检索效率。通过理论分析给出使用sketch cube的前提条件,同时通过真实海量流数据实验分析表明,sketch sube在有效性、存储空间效率和正确率上可以满足实时挖掘的需求。
Stream data has been one of the most significant data format recently. OLAM (online analytical mining) operation could provide multi-level data views for analysts. However, OLAM operations depend on data aggregation, which is in conflict with the continuous incensement and dynamic nature of stream data. Thus, partial materialized view from stream data directly by typical OLAP approaches cannot be created and all data cells for the limitation of storage cannot be saved. In order to solve the above problems, an advanced sketch based OLAM framework named sketch cube to analyze stream data was proposed. Sketch cube maps a set of attributes to a unique number and stores it in sub-linear data structure, and then builds an inverted index by tiled time window. The precondition of using sketch cube by theoretical analysis was given and the storage efficiency and query performance on mass mobile data corpus was evaluated, which supports requirements of real-time analysis.
出处
《电信科学》
北大核心
2014年第9期61-71,共11页
Telecommunications Science
基金
浙江省自然科学基金资助项目(No.LQ14F020002)
浙江省本科院校中青年学科带头人学术攀登基金资助项目(No.PD2013453)
关键词
流数据
概要立方体
联机分析挖掘
实时分析
stream data, sketch cube, online analytical mining, real-time analysis