摘要
垃圾邮件问题日益严重,受到研究人员的广泛关注.基于内容分类与过滤垃圾邮件是当前解决垃圾邮件问题的主流技术之一.本文对电子邮件内容做了深入的研究,提出了一种更适合垃圾邮件分类的新的特征提取方法,并将新的特征提取方法与基于essential emerging pattern(eEP)的分类算法CeEP相结合,应用于垃圾邮件检测,实现了一种基于eEP的电子邮件分类与过滤算法(thee-mail categorization and filtering technology based on eEP,ECFEP).实验表明,新的特征提取方法与CeEP分类算法的结合是一种十分高效的分类方法,算法ECFEP的分类效率均高于目前几种较好的分类算法.
The volume of junk emails on the Internet has grown tremendously in the past few years. There have been more spam volume has been more than the number of normal e-mails which is causing serious problems. Content-based filtering is one of mainstream technologies used so far. E-mail feature extraction methods mainly use text classification feature extraction methods at present. However, through analysis we found that the content of e- mail has its uniqueness. Using only text classification feature extraction methods will cause problems and reduce the efficiency of classification. The categorization methods based on emerging pattern(EP) view the samples as sets of items instead of the points in the n-dimension space. Emerging patterns (EPs) are itemsets whose supports change significantly from one data class to another. They can serve as a good classification model because they can capture the inherent distinctions between different classes of data, and represent knowledge discriminating between different classes of datasets. So EPs are useful in building accurate classifiers. The essential emerging pattern (eEP) is a special kind of EP. The eEP not only has all the virtues of EP that are very useful for constructing accurate classifiers, but also has fewer quantities that are very efficient for mining and using them.
The categorization methods based on EP have an equivalent performance with C4.5 and Naive Bayes methods. The categorization methods based on EP have been applied in many fields successfully, such as DNA analysis, but we do not see the reports about applying categorization methods based on EP to e-mail categorization and filtering technology.
This paper preprocesses text of the e-mail and comes up with a new spam feature extraction method which makes it more appropriate to e-mail classification the email content study and in view of the uniqueness of e-mail content. This paper use the classification algorithm by essential emerging patterns which is Data Mining researchers' new classification method in the junk email examination, and carries out a new categorization and filtering algorithm ECFEP(The e-mail categorization and filtering technology based on eEP) of emails based on the EP. In the spare language database experiment, parameters for different values of ECFEP algorithm results of the evaluation; with parameters fixed, the growth rate of change indicator of the trend of changes in evaluation; as well as naive bayes (Nbayes) classification algorithm, k-nearest neighbor (KNN) algorithm, decision tree algorithm, Bayesian neural network algorithm for comparing, these three experimental ways show, that the new feature extraction methods and the combination of classification based on eEP are a very efficient method of classification, and the classification efficiency of the algorithm ECFEP is higher than several current classification algorithms.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2008年第5期544-550,共7页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(60773048)
关键词
电子邮件分类
特征提取
基本显露模式
e-mail categorization, feature extraction, essential emerging patterns