期刊文献+

流数据环境下基于k集合覆盖的分布式标签共现算法 被引量:1

K-set-over based distributed tag co-occurrence algorithm in streaming data
在线阅读 下载PDF
导出
摘要 通过分析集值属性的标签共现频率,可以挖掘频繁模式以及进行异常的检测。为了提高标签共现计算的性能,提出了一种流数据环境下基于k集合覆盖的分布式标签共现算法。采用多集合的容斥原理对标签共现问题进行了分析,并提出了一种分布式标签共现计算流程;通过引入信息检索中的倒排索引对标签及其出处进行索引,基于k集合覆盖的思想将整个倒排索引划分到多个分布式从节点上,并根据流数据的变化动态地更新每个从节点的局部索引,在对所有从节点的结果进行汇聚后得到最终结果。实验表明,提出的基于k集合覆盖的分布式标签共现算法与其他算法相比较,不仅具有较低的平均更新时间,而且使用更少的索引副本,因而更适用于大规模流数据的标签共现计算。 According to the analysis of tag co-occurance with set-valued attribute, users can mine frequent patterns and detect anomalies. In order to improve the performance of computing tag co-oecurance, this paper proposed a k-set-over based distri- buted tag co-occurrence algorithm in streaming data. It applied the inclusion-exclusion principle to analyze the problem of tag co-occurance, partitioned the total invert index into multiple distributed nodes based on k-set-cover, updated the local index for each distributed node dynamically according to the streaming data, and got the final result by aggregating all results from distributed nodes. The experiments show that, compared with the related works, the proposed algorithm has less average up- date time while keeping less index replication, and thus is more suitable for computing tag co-oecurance in large-scale stream- ing data.
作者 朱明 李跃新
出处 《计算机应用研究》 CSCD 北大核心 2016年第2期428-430,434,共4页 Application Research of Computers
基金 湖北省重大科技支持项目(2014BAA089)
关键词 流数据 分布式标签 共现算法 k集合覆盖 streaming data distributed tag co-occurance algorithm k-set-cover
  • 相关文献

参考文献11

  • 1Croft W B,Metzler D,Strohman T.Search engines:information retrieval in practice[M].Boston:Addison-Wesley,2010.
  • 2Song Yang,Zhuang Ziming,Li Huajiang,et al.Real-time automatic tag recommendation[C]//Proc of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM Press,2008:515-522.
  • 3Hong Liangjie,Davison B D.Empirical study of topic modeling in Twitter[C]//Proc of the 1st Workshop on Social Media Analytics.New York:ACM Press,2010:80-88.
  • 4Brooks R R,Griffin C,Friedlander D S.Self-organized distributed sensor network entity tracking[J].International Journal of High Performance Computing Applications,2002,16(3):207-219.
  • 5刘军,李银周,Felix Cuadrado,Steve Uhlig,雷振明.Parallelized Jaccard-Based Learning Method and MapReduce Implementation for Mobile Devices Recognition from Massive Network Data[J].China Communications,2013,10(7):71-84. 被引量:2
  • 6金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量:161
  • 7Li Jin,Maier D,Tufte K,et al.No pane,no gain:efficient evaluation of sliding-window aggregates over data streams[J].ACM SIGMOD Record,2005,34(1):39-44.
  • 8Bedini I,Sakr S,Theeten B,et al.Modeling performance of a parallel streaming engine:bridging theory and costs[C]//Proc of the 4th ACM/SPEC International Conference on Performance Engineering.New York:ACM Press,2013:173-184.
  • 9Davidov D,Tsur O,Rappoport A.Enhanced sentiment learning using Twitter hashtags and smileys[C]//Proc of the 23rd International Conference on Computational Linguistics.[S.l.] :Association for Computational Linguistics,2010:241-249.
  • 10Taylor P J,Donald I J,Jacques K,et al.Jaccard’s heel:Radex models of criminal behaviour are rarely falsifiable when derived using Jaccard coefficient[J].Legal and Criminological Psychology,2012,17(1):41-58.

二级参考文献71

  • 1Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data streams. In: Popa L, ed. Proc. of the 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. Madison: ACM Press, 2002. 1~16.
  • 2Terry D, Goldberg D, Nichols D, Oki B. Continuous queries over append-only databases. SIGMOD Record, 1992,21(2):321-330.
  • 3Avnur R, Hellerstein J. Eddies: Continuously adaptive query processing. In: Chen W, Naughton JF, Bernstein PA, eds. Proc. of the 2000 ACM SIGMOD Int'l Conf. on Management of Data. Dallas: ACM Press, 2000. 261~272.
  • 4Hellerstein J, Franklin M, Chandrasekaran S, Deshpande A, Hildrum K, Madden S, Raman V, Shah MA. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 2000,23(2):7-18.
  • 5Carney D, Cetinternel U, Cherniack M, Convey C, Lee S, Seidman G, Stonebraker M, Tatbul N, Zdonik S. Monitoring streams?A new class of DBMS applications. Technical Report, CS-02-01, Providence: Department of Computer Science, Brown University, 2002.
  • 6Guha S, Mishra N, Motwani R, O'Callaghan L. Clustering data streams. In: Blum A, ed. The 41st Annual Symp. on Foundations of Computer Science, FOCS 2000. Redondo Beach: IEEE Computer Society, 2000. 359-366.
  • 7Domingos P, Hulten G. Mining high-speed data streams. In: Ramakrishnan R, Stolfo S, Pregibon D, eds. Proc. of the 6th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. Boston: ACM Press, 2000. 71-80.
  • 8Domingos P, Hulten G, Spencer L. Mining time-changing data streams. In: Provost F, Srikant R, eds. Proc. of the 7th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. San Francisco: ACM Press, 2001. 97~106.
  • 9Zhou A, Cai Z, Wei L, Qian W. M-Kernel merging: Towards density estimation over data streams. In: Cha SK, Yoshikawa M, eds. The 8th Int'l Conf. on Database Systems for Advanced Applications (DASFAA 2003). Kyoto: IEEE Computer Society, 2003. 285~292.
  • 10Gibbons PB, Matias Y. Synopsis data structures for massive data sets. In: Tarjan RE, Warnow T, eds. Proc. of the 10th Annual ACM-SIAM Symp. on Discrete Algorithms. Baltimore: ACM/SIAM, 1999. 909-910.

共引文献161

同被引文献8

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部