期刊文献+

基于不平衡数据集的文本分类技术研究 被引量:1

Unbalanced Data Sets Based on the Text Classification Technology Research
在线阅读 下载PDF
导出
摘要 文本自动分类是数据挖掘和信息检索的核心技术,也是研究热点。在实际的应用中,时常会出现文本数据量很大,但是对人们有用的信息仅占一小部分,这种某类样本数量明显少于其他类样本数量的数据就是不平衡数据集。不平衡数据集可以分类为少数类和多数类。传统方法对少数类的识别率比较低,如何有效地提高少数类的分类性能成为了模式识别和机器学习必须解决的问题。就提高不平衡数据集的少数类文本的分类性能问题,从数据层面处理角度对数据进行了重抽样,采用随机抽样的办法来提高分类器在不平衡数据集的泛化性能。 Automatic text classification is a core technology in data mining and information retrieval community,but also research focus.In practical applications,the text will appear from time to time large amounts of data,but useful information on people only a small part of them,such data that certain number of samples was less than the number of other types of samples is called unbalanced data sets.Unbalanced data sets can be classified as a small number of classes and the majority of classes.The recognition rate of traditional method to a small number of classes is relatively low,so how to effectively improve the classification performance of a small number of classes has become a problem must be solved in pattern recognition and machine learning.In order to improve the minority class imbalanced data set classification performance of text,this paper from the data level processing point of view conducted a re-sampling,as well as used random sampling methods to improve the classifier in the generalization performance of unbalanced data sets.
作者 白凤凤
出处 《电脑编程技巧与维护》 2010年第6期21-22,29,共3页 Computer Programming Skills & Maintenance
关键词 文本自动分类 不平衡数据集 少数类 Automatic text categorization Unbalanced data set A small number of class
  • 相关文献

参考文献5

二级参考文献23

  • 1赵世奇,张宇,刘挺,陈毅恒,黄永光,李生.基于类别特征域的文本分类特征选择方法[J].中文信息学报,2005,19(6):21-27. 被引量:21
  • 2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:391
  • 3Yang Yiming,Pederson J O.A Comparative Study on Feature Selection in Text Categorization [A].Proceedings of the 14th International Conference on Machine learning[C].Nashville:Morgan Kaufmann,1997:412-420.
  • 4Y.Yang.Noise reduction in a statistical approach to text categorization[A].Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR95)[C].Seattle:ACM Press,1995:256-263.
  • 5Thorsten Joachims,Text Categorization with Support Vector Machines:Learning with Many Relevant Features[A],In:European Conferrence on Machine Learning (ECML)[C].Berlin:Springer,1998,137-142.
  • 6Mlademnic,D.,Grobelnik,M.Feature Selection for unbalanced class distribution and Nave Bayees[A].Proceedings of the Sixteenth International Conference on Machine Learning[C].Bled:Morgan Kaufmann,1999:258-267.
  • 7梁久祯 兰东俊 扈旻.基于先验知识的网页特征压缩与线性分类器设计[A]..第十二届全国神经计算学术大会论文集[C].北京:人民邮电出版社,2002.494-501.
  • 8Kubat M, Holte R C, Stan M. Machine learning for the detection o:f oil spills in satellite radar images[J]. Machine Learning, 1998,30 (2) : 195- 215.
  • 9Randall W D, Martinez T R. Reduction techniques for instance-based learning algorithms[J]. Machine Learning, 2000,38 (3) : 257- 286.
  • 10Guo H Y, Viktor H L. Learning from imbalanced data sets with boosting and data generation: the data boost-IM approach[J]. SIGKDD Explorations, 2004, 6(1):30-39.

共引文献389

同被引文献4

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部