摘要
通过分析集值属性的标签共现频率,可以挖掘频繁模式以及进行异常的检测。为了提高标签共现计算的性能,提出了一种流数据环境下基于k集合覆盖的分布式标签共现算法。采用多集合的容斥原理对标签共现问题进行了分析,并提出了一种分布式标签共现计算流程;通过引入信息检索中的倒排索引对标签及其出处进行索引,基于k集合覆盖的思想将整个倒排索引划分到多个分布式从节点上,并根据流数据的变化动态地更新每个从节点的局部索引,在对所有从节点的结果进行汇聚后得到最终结果。实验表明,提出的基于k集合覆盖的分布式标签共现算法与其他算法相比较,不仅具有较低的平均更新时间,而且使用更少的索引副本,因而更适用于大规模流数据的标签共现计算。
According to the analysis of tag co-occurance with set-valued attribute, users can mine frequent patterns and detect anomalies. In order to improve the performance of computing tag co-oecurance, this paper proposed a k-set-over based distri- buted tag co-occurrence algorithm in streaming data. It applied the inclusion-exclusion principle to analyze the problem of tag co-occurance, partitioned the total invert index into multiple distributed nodes based on k-set-cover, updated the local index for each distributed node dynamically according to the streaming data, and got the final result by aggregating all results from distributed nodes. The experiments show that, compared with the related works, the proposed algorithm has less average up- date time while keeping less index replication, and thus is more suitable for computing tag co-oecurance in large-scale stream- ing data.
出处
《计算机应用研究》
CSCD
北大核心
2016年第2期428-430,434,共4页
Application Research of Computers
基金
湖北省重大科技支持项目(2014BAA089)
关键词
流数据
分布式标签
共现算法
k集合覆盖
streaming data
distributed tag
co-occurance algorithm
k-set-cover