摘要
文本自动分类是数据挖掘和信息检索的核心技术,也是研究热点。在实际的应用中,时常会出现文本数据量很大,但是对人们有用的信息仅占一小部分,这种某类样本数量明显少于其他类样本数量的数据就是不平衡数据集。不平衡数据集可以分类为少数类和多数类。传统方法对少数类的识别率比较低,如何有效地提高少数类的分类性能成为了模式识别和机器学习必须解决的问题。就提高不平衡数据集的少数类文本的分类性能问题,从数据层面处理角度对数据进行了重抽样,采用随机抽样的办法来提高分类器在不平衡数据集的泛化性能。
Automatic text classification is a core technology in data mining and information retrieval community,but also research focus.In practical applications,the text will appear from time to time large amounts of data,but useful information on people only a small part of them,such data that certain number of samples was less than the number of other types of samples is called unbalanced data sets.Unbalanced data sets can be classified as a small number of classes and the majority of classes.The recognition rate of traditional method to a small number of classes is relatively low,so how to effectively improve the classification performance of a small number of classes has become a problem must be solved in pattern recognition and machine learning.In order to improve the minority class imbalanced data set classification performance of text,this paper from the data level processing point of view conducted a re-sampling,as well as used random sampling methods to improve the classifier in the generalization performance of unbalanced data sets.
出处
《电脑编程技巧与维护》
2010年第6期21-22,29,共3页
Computer Programming Skills & Maintenance
关键词
文本自动分类
不平衡数据集
少数类
Automatic text categorization
Unbalanced data set
A small number of class