摘要
针对决策树C4.5/5.0分类算法及改进的算法在创建决策树时训练误差率和校验误差率相对较高的缺点,提出一些改进策略,即利用属性相关性进行属性约简与度量以达到解决属性集合中的冗余属性,采用一定置信度值进行决策树的修剪,采用优化的Chi2算法更合理更准确地对连续属性进行离散化,基于改进策略设计并实现一个分类器,将改进的算法应用于Breast-cancer实例,实验结果证明改进的算法生成的决策树具有较高的分类正确率。
In order to effectively deal with the problems that the training error and test error are comparatively high when decision tree is built based on C4.5 and C5.0 decision tree algorithms,three improved strategies are presented.The improved strategies are as follows:Attribute correlation that can not only remove irrelevant features,also can find redundant feature with high feature correlation,is to quantify the correlation between attribute and concept;pruning strategy adopts appropriate confidence to good purpose,then reduces the attribute number and the different value of each attribute assuring the feasibility and effectiveness of the decision tree;a variation of the Chi2 algorithm is proposed to perform attribute discretization and selection great exactly.The improved strategies are applied to the Breast-cancer data and the simulation validates their efficiency.Through experiment testing,the improved algorithm can construct the better accuracy of classification compared with the classical decision tree algorithms.
出处
《计算机工程与应用》
CSCD
北大核心
2010年第13期139-141,150,共4页
Computer Engineering and Applications
基金
江苏省高校自然科学基础研究No.07KJD520216
徐州师范大学项目基金No.KY200710~~
关键词
属性相关性
属性约束
剪枝策略
离散化
CHI2算法
attribute correlation
attribute reduction
pruning strategy
discretization
Chi2 algorithm