流数据环境下基于k集合覆盖的分布式标签共现算法被引量：1

K-set-over based distributed tag co-occurrence algorithm in streaming data

在线阅读下载PDF

导出

摘要通过分析集值属性的标签共现频率,可以挖掘频繁模式以及进行异常的检测。为了提高标签共现计算的性能,提出了一种流数据环境下基于k集合覆盖的分布式标签共现算法。采用多集合的容斥原理对标签共现问题进行了分析,并提出了一种分布式标签共现计算流程;通过引入信息检索中的倒排索引对标签及其出处进行索引,基于k集合覆盖的思想将整个倒排索引划分到多个分布式从节点上,并根据流数据的变化动态地更新每个从节点的局部索引,在对所有从节点的结果进行汇聚后得到最终结果。实验表明,提出的基于k集合覆盖的分布式标签共现算法与其他算法相比较,不仅具有较低的平均更新时间,而且使用更少的索引副本,因而更适用于大规模流数据的标签共现计算。 According to the analysis of tag co-occurance with set-valued attribute, users can mine frequent patterns and detect anomalies. In order to improve the performance of computing tag co-oecurance, this paper proposed a k-set-over based distri- buted tag co-occurrence algorithm in streaming data. It applied the inclusion-exclusion principle to analyze the problem of tag co-occurance, partitioned the total invert index into multiple distributed nodes based on k-set-cover, updated the local index for each distributed node dynamically according to the streaming data, and got the final result by aggregating all results from distributed nodes. The experiments show that, compared with the related works, the proposed algorithm has less average up- date time while keeping less index replication, and thus is more suitable for computing tag co-oecurance in large-scale stream- ing data.

作者朱明李跃新

机构地区湖北大学计算机与信息工程学院

出处《计算机应用研究》 CSCD 北大核心 2016年第2期428-430,434,共4页 Application Research of Computers

基金湖北省重大科技支持项目(2014BAA089)

关键词流数据分布式标签共现算法 k集合覆盖 streaming data distributed tag co-occurance algorithm k-set-cover

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献11

1Croft W B,Metzler D,Strohman T.Search engines:information retrieval in practice[M].Boston:Addison-Wesley,2010.
2Song Yang,Zhuang Ziming,Li Huajiang,et al.Real-time automatic tag recommendation[C]//Proc of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM Press,2008:515-522.
3Hong Liangjie,Davison B D.Empirical study of topic modeling in Twitter[C]//Proc of the 1st Workshop on Social Media Analytics.New York:ACM Press,2010:80-88.
4Brooks R R,Griffin C,Friedlander D S.Self-organized distributed sensor network entity tracking[J].International Journal of High Performance Computing Applications,2002,16(3):207-219.
5刘军,李银周,Felix Cuadrado,Steve Uhlig,雷振明.Parallelized Jaccard-Based Learning Method and MapReduce Implementation for Mobile Devices Recognition from Massive Network Data[J].China Communications,2013,10(7):71-84. 被引量：2
6金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量：161
7Li Jin,Maier D,Tufte K,et al.No pane,no gain:efficient evaluation of sliding-window aggregates over data streams[J].ACM SIGMOD Record,2005,34(1):39-44.
8Bedini I,Sakr S,Theeten B,et al.Modeling performance of a parallel streaming engine:bridging theory and costs[C]//Proc of the 4th ACM/SPEC International Conference on Performance Engineering.New York:ACM Press,2013:173-184.
9Davidov D,Tsur O,Rappoport A.Enhanced sentiment learning using Twitter hashtags and smileys[C]//Proc of the 23rd International Conference on Computational Linguistics.[S.l.] :Association for Computational Linguistics,2010:241-249.
10Taylor P J,Donald I J,Jacques K,et al.Jaccard’s heel:Radex models of criminal behaviour are rarely falsifiable when derived using Jaccard coefficient[J].Legal and Criminological Psychology,2012,17(1):41-58.

二级参考文献71

1Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data streams. In: Popa L, ed. Proc. of the 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. Madison: ACM Press, 2002. 1～16.
2Terry D, Goldberg D, Nichols D, Oki B. Continuous queries over append-only databases. SIGMOD Record, 1992,21(2):321-330.
3Avnur R, Hellerstein J. Eddies: Continuously adaptive query processing. In: Chen W, Naughton JF, Bernstein PA, eds. Proc. of the 2000 ACM SIGMOD Int'l Conf. on Management of Data. Dallas: ACM Press, 2000. 261～272.
4Hellerstein J, Franklin M, Chandrasekaran S, Deshpande A, Hildrum K, Madden S, Raman V, Shah MA. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 2000,23(2):7-18.
5Carney D, Cetinternel U, Cherniack M, Convey C, Lee S, Seidman G, Stonebraker M, Tatbul N, Zdonik S. Monitoring streams?A new class of DBMS applications. Technical Report, CS-02-01, Providence: Department of Computer Science, Brown University, 2002.
6Guha S, Mishra N, Motwani R, O'Callaghan L. Clustering data streams. In: Blum A, ed. The 41st Annual Symp. on Foundations of Computer Science, FOCS 2000. Redondo Beach: IEEE Computer Society, 2000. 359-366.
7Domingos P, Hulten G. Mining high-speed data streams. In: Ramakrishnan R, Stolfo S, Pregibon D, eds. Proc. of the 6th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. Boston: ACM Press, 2000. 71-80.
8Domingos P, Hulten G, Spencer L. Mining time-changing data streams. In: Provost F, Srikant R, eds. Proc. of the 7th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. San Francisco: ACM Press, 2001. 97～106.
9Zhou A, Cai Z, Wei L, Qian W. M-Kernel merging: Towards density estimation over data streams. In: Cha SK, Yoshikawa M, eds. The 8th Int'l Conf. on Database Systems for Advanced Applications (DASFAA 2003). Kyoto: IEEE Computer Society, 2003. 285～292.
10Gibbons PB, Matias Y. Synopsis data structures for massive data sets. In: Tarjan RE, Warnow T, eds. Proc. of the 10th Annual ACM-SIAM Symp. on Discrete Algorithms. Baltimore: ACM/SIAM, 1999. 909-910.

共引文献161

1田李,王乐,贾焰,邹鹏,李爱平.分布式数据流上低通信开销的连续极值查询方法研究[J].计算机研究与发展,2007,44(z3):61-66.
2陈飞波,钱卫宁,周傲英.基于最窄平行四边形的数据流突变检测算法[J].计算机研究与发展,2007,44(z3):505-510.
3何月梅,杜海艳,王保民.分形技术与矢量量化相结合的网络流量异常检测研究[J].邯郸学院学报,2009,19(3):73-76.
4秦林新,刘奇志.一种乱序数据流上的偏倚抽样算法[J].计算机研究与发展,2011,48(S3):298-303.
5张明明,芦琳.电能计量中的异常数据研究[J].电气应用,2013,0(S1):42-46. 被引量：2
6金澈清,崇志宏,周傲英.一种实时监控最近邻的近似算法[J].计算机科学与探索,2007,1(2):146-159.
7杨宜东,孙志挥,张净.基于核密度估计的分布数据流离群点检测[J].计算机研究与发展,2005,42(9):1498-1504. 被引量：9
8杜威,邹先霞.基于数据流的滑动窗口机制的研究[J].计算机工程与设计,2005,26(11):2922-2924. 被引量：11
9刘赏,黄亚楼,倪维健.流数据聚类模型变化检测策略[J].计算机工程与应用,2006,42(5):15-18.
10彭宏,刘洋,邓维维,郑启伦.股票数据流的相关性计算方法[J].华南理工大学学报（自然科学版）,2006,34(1):86-89. 被引量：9

同被引文献8

1秦秀磊,张文博,魏峻,王伟,钟华,黄涛.云计算环境下分布式缓存技术的现状与挑战[J].软件学报,2013,24(1):50-66. 被引量：75
2Qingliang CHEN,Kaile SU,Yong HU,Guiwu HU.A complete coalition logic of temporal knowledge for multi-agent systems[J].Frontiers of Computer Science,2015,9(1):75-86. 被引量：3
3郭昆,宋杰,王洁萍,朱志良.NoSQL数据库间数据交换代价研究[J].计算机工程与科学,2016,38(1):33-40. 被引量：4
4钱晓军,范冬萍,吉根林.物联网差异数据库中的故障数据快速挖掘仿真[J].计算机仿真,2016,33(1):301-304. 被引量：6
5朱保锋,苏小玲.大型网络异常数据库的快速数据定位模型仿真[J].微电子学与计算机,2016,33(2):140-143. 被引量：10
6张晓琳,崔宁宁,杨涛,李洁.一种分层自适应快速K-means算法[J].计算机应用研究,2016,33(2):421-423. 被引量：7
7于彦伟,齐建鹏,陆云辉,赵金东,张永刚.时空轨迹大数据分布式蜂群模式挖掘算法[J].计算机工程与科学,2016,38(2):255-261. 被引量：10
8王习特,申德荣,白梅,聂铁铮,寇月,于戈.BOD:一种高效的分布式离群点检测算法[J].计算机学报,2016,39(1):36-51. 被引量：30

引证文献1

1于兴平,李洪建,于腾飞,毕卫红.分布式系统数据时序更新方法[J].软件工程,2016,19(5):23-25.

1马晓慧.一种改进的可并行的K-medoids聚类算法[J].智能计算机与应用,2016,6(3):100-102. 被引量：1
2夏宁霞,苏一丹,覃华,张敏.社会化标签系统中个性化的用户建模方法[J].计算机应用,2011,31(6):1667-1670. 被引量：10
3宋友平,王家宝,苗壮.基于共同属性和标签共现的标签消歧算法[J].解放军理工大学学报（自然科学版）,2016,17(5):409-412.
4高会生,展敬宇,王博颖.基于最小路集的网络可靠性分析方法研究[J].信息网络安全,2011(10):28-31. 被引量：9
5王娅丹,李鹏,金瑜,刘宇.标签共现的标签聚类算法研究[J].计算机工程与应用,2015,51(2):146-150. 被引量：3
6魏锐,李留青.网络中多敏感属性数据发布隐私保护研究[J].电子设计工程,2014,22(17):154-157. 被引量：1
7陈梅梅,薛康杰.基于改进张量分解模型的个性化推荐算法研究[J].数据分析与知识发现,2017,1(3):38-45. 被引量：7
8李慧宗,胡学钢.基于MapReduce的社会化标签共现关系抽取方法[J].小型微型计算机系统,2013,34(11):2456-2460. 被引量：1
9高宏宾,杨翠.基于权重与共现的标签聚类算法的研究[J].工业控制计算机,2014,27(6):116-117.
10张滇,岳磅,江小燕,毛睿.应对海量数据检索:分布式局部索引的架构[J].计算机时代,2013(8):1-4. 被引量：2

计算机应用研究

2016年第2期

浏览历史

内容加载中请稍等...

流数据环境下基于k集合覆盖的分布式标签共现算法被引量：1

参考文献11

二级参考文献71

共引文献161

同被引文献8

引证文献1

相关作者

相关机构

相关主题

浏览历史

流数据环境下基于k集合覆盖的分布式标签共现算法 被引量：1

参考文献11

二级参考文献71

共引文献161

同被引文献8

引证文献1

相关作者

相关机构

相关主题

浏览历史

流数据环境下基于k集合覆盖的分布式标签共现算法被引量：1