期刊文献+

基于基本显露模式的电子邮件分类与过滤技术 被引量:3

E-mail categorization and filtering technology based on essential emerging pattern
在线阅读 下载PDF
导出
摘要 垃圾邮件问题日益严重,受到研究人员的广泛关注.基于内容分类与过滤垃圾邮件是当前解决垃圾邮件问题的主流技术之一.本文对电子邮件内容做了深入的研究,提出了一种更适合垃圾邮件分类的新的特征提取方法,并将新的特征提取方法与基于essential emerging pattern(eEP)的分类算法CeEP相结合,应用于垃圾邮件检测,实现了一种基于eEP的电子邮件分类与过滤算法(thee-mail categorization and filtering technology based on eEP,ECFEP).实验表明,新的特征提取方法与CeEP分类算法的结合是一种十分高效的分类方法,算法ECFEP的分类效率均高于目前几种较好的分类算法. The volume of junk emails on the Internet has grown tremendously in the past few years. There have been more spam volume has been more than the number of normal e-mails which is causing serious problems. Content-based filtering is one of mainstream technologies used so far. E-mail feature extraction methods mainly use text classification feature extraction methods at present. However, through analysis we found that the content of e- mail has its uniqueness. Using only text classification feature extraction methods will cause problems and reduce the efficiency of classification. The categorization methods based on emerging pattern(EP) view the samples as sets of items instead of the points in the n-dimension space. Emerging patterns (EPs) are itemsets whose supports change significantly from one data class to another. They can serve as a good classification model because they can capture the inherent distinctions between different classes of data, and represent knowledge discriminating between different classes of datasets. So EPs are useful in building accurate classifiers. The essential emerging pattern (eEP) is a special kind of EP. The eEP not only has all the virtues of EP that are very useful for constructing accurate classifiers, but also has fewer quantities that are very efficient for mining and using them. The categorization methods based on EP have an equivalent performance with C4.5 and Naive Bayes methods. The categorization methods based on EP have been applied in many fields successfully, such as DNA analysis, but we do not see the reports about applying categorization methods based on EP to e-mail categorization and filtering technology. This paper preprocesses text of the e-mail and comes up with a new spam feature extraction method which makes it more appropriate to e-mail classification the email content study and in view of the uniqueness of e-mail content. This paper use the classification algorithm by essential emerging patterns which is Data Mining researchers' new classification method in the junk email examination, and carries out a new categorization and filtering algorithm ECFEP(The e-mail categorization and filtering technology based on eEP) of emails based on the EP. In the spare language database experiment, parameters for different values of ECFEP algorithm results of the evaluation; with parameters fixed, the growth rate of change indicator of the trend of changes in evaluation; as well as naive bayes (Nbayes) classification algorithm, k-nearest neighbor (KNN) algorithm, decision tree algorithm, Bayesian neural network algorithm for comparing, these three experimental ways show, that the new feature extraction methods and the combination of classification based on eEP are a very efficient method of classification, and the classification efficiency of the algorithm ECFEP is higher than several current classification algorithms.
作者 李艳 范明
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2008年第5期544-550,共7页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金(60773048)
关键词 电子邮件分类 特征提取 基本显露模式 e-mail categorization, feature extraction, essential emerging patterns
  • 相关文献

参考文献18

  • 1Tan P N,Michael S,Vipin K等.数据挖掘导论.范明,范宏建.北京:人民邮电出版社,2006,259-293.
  • 2Dong G, Zhang X, Wong L, et al. CAEP: Classification by aggregating emerging patterns. Proceedings of the 2^nd International Conference on Discovery Science. Berlin: Springe-Verlag, 1999, 30-42.
  • 3Witten I H, Frank E. Data mining.. Practical machine learning tools and techniques. 2^nd Edition. San Francisco: Morgan Kaufmann,2005, 560.
  • 4Fan H, Ramamohanarao K. Bayesian approach to use emerging patterns for classification. Proceedings of the 14^th Australasian Database Conference. Australia: Australian Computer Society, 2003, 39-48.
  • 5范明,刘孟旭,赵红领.一种基于基本显露模式的分类算法[J].计算机科学,2004,31(11):211-214. 被引量:11
  • 6范明 魏芳.挖掘基本显露模式用于分类[J].计算机科学,2004,31:207-309.
  • 7许红涛 范明 昝红英.一种基于eEP的中文文本分类算法[J].计算机研究与发展,2005,(9):351-355.
  • 8罗浩,方滨兴,唐剑琪.垃圾邮件问题及其处理方法[J].电信科学,2006,22(2):48-52. 被引量:2
  • 9Zhang L, Yao T S. Filtering junk mail with a maximum entropy model. Proceedings of the 20^th International Conference on Computer Processing of Oriental Language. http://www. nlplab. cn/zhangle/paper/junk. pdf. 2003.
  • 10Shlomo H. Behavior-based email analysis with application to spam detection. Ph. D Thesis. Columbia University. 2006

二级参考文献121

  • 1李渝勤,孙丽华.基于规则的自动分类在文本分类中的应用[J].中文信息学报,2004,18(4):9-14. 被引量:20
  • 2邓爱林,左子叶,朱扬勇.基于项目聚类的协同过滤推荐算法[J].小型微型计算机系统,2004,25(9):1665-1670. 被引量:147
  • 3宋丽哲,牛振东,宋瀚涛,余正涛,师雪霖.数字图书馆个性化服务用户模型研究[J].北京理工大学学报,2005,25(1):58-62. 被引量:45
  • 4刘震,佘堃,周明天.基于多级属性集的垃圾邮件过滤技术[J].计算机应用研究,2005,22(7):122-123. 被引量:5
  • 5Sharkey A J C. On combining artificial neural networks. Connection Science, 1996, 8 : 299 -313.
  • 6Krogh A, Vedelsby J. Neural networks ensembles,cross validation, and active learning. Tesauro G,Touretzky D, Lee T. Advances in Neural Information Processing Systems. Cambridge: Massachusetts Institute of Technology Press, 1995 ( 8 ) :231 -238.
  • 7Freund Y, Schapire R. Experimants with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning. Italy: Bari, 1996, 148 - 156.
  • 8Breiman L. Bagging predictors. Machine Learning,1996, 24(2) :123 - 140.
  • 9Liu Y, Yao X. Ensemble learning via negative correlation. Neural Networks, 1999, 12 ( 10 ) :1 399-1 404.
  • 10Benediktsson J A, Sveinsson J R, Ersoy O K. Optimized combination of neural networks. Proceedings of the IEEE International Symposium on Circuits and Systems, 1996,3 : 535 - 538.

共引文献192

同被引文献55

  • 1Duda R O, Hart P E, Stork D G. Pattern classification. Second Edition. John Wiley, 2000, 20-82.
  • 2De Campos L M. A scoring function for learning Bayesian networks based on mutual Information and conditional independence tests. The Journal of Machine Learning Research, 2006,7 : 2149-2187.
  • 3Tsamardinos I, Brown L E, Aliferis C F. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning. 2006,65(1) : 31-78.
  • 4Cooper G F,Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 1992, 9 ( 4 ): 309-347.
  • 5Heckerman D, Geiger D, Chickering D M. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning., 1995,20(3).. 197-243.
  • 6Friedman N, Koller D. Being Bayesian about network structure. Proceedings of the 16^th conference on uncertainty in artificial intelligence. San Francisco, Morgan Kaufmann Publishers, 2000, 201-210.
  • 7Neapolitan R E. Learning Bayesian Networks. Proceedins of the 13^th ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM, 2007, 441-491.
  • 8Dagum P,Luby M. Approximating probabilistic inference in Bayesian belief networks is NP- hard. Artificial Intelligence, 1993, 60 ( 1 ): 141-153.
  • 9Chickering D M, Heekerman D, Meek C. Largesample learning of Bayesian networks is NP- hard. The Journal of Machine Learning Research, 2004,5: 1287-1330.
  • 10Lucas P. Restricted Bayesian network structure learning. Advances in Bayesian Networks, Studies in Fuzziness and Soft Computing, 2004, 217-232.

引证文献3

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部