摘要
频繁项集的挖掘是数据挖掘的核心内容。本文提出挖掘最大频繁项集的并行算法P-MinMax,它采用数据库的垂直表示和基于前缀关系的等价类划分,利用因子项集的完全包含关系在处理机之间贪心分配等价类,根据等价类的需要相应地划分和有选择地复制数据库记录,使各处理机得以异步计算,达到了较好的负载平衡。分析和实验表明,P-MinMax有较好的可扩展性,其性能优于已有同类算法。
Mining frequent itemsets is a crucial issue in data mining applications. The complexity of the problem has been shown as NP-hard. Parallel techniques are widely used to improve the efficiency of mining algorithms. A novel and powerful parallel algorithm for mining maximal frequent itemsets, called P-MinMax, is proposed in this paper, which is based on its serial version MinMax. The new algorithm decomposes the search space by prefix-based equivalence classes, distributes work among the processors by complete inclusive relation between equivalence class gene itemsets and selectively duplicates databases in such a way that each processor can compute the frequent itemsets independently. These techniques eliminate the need for synchronization, drastically cutting down the I/O overhead. The analysis and experimental results demonstrate the superb efficiency of the approach in comparison with the previous work.
出处
《计算机科学》
CSCD
北大核心
2004年第12期132-134,188,共4页
Computer Science
基金
国家自然科学基金(60273075)