摘要
针对决策树构建过程中易于出现数据碎片、子树重复等问题,提出了基于分形维构建特征数据集的方法:因为信息增益表示了该属性信息量的多少,因此在确定了数据集的嵌入维数k之后,选择信息增益最大的前k个属性构造原数据集的特征数据集,并分析了依据分形维数和信息增益对冗余属性的删除以及特征集的信息损失对决策树构建的影响。实验过程中,分别采用从原始属性中选择及拟合两种方法构建特征数据集,依据对实验结果的比较分析,进一步证明了该方法的有效。
For the key issuses that how to reduce the data fragmentation and sub- tree repeat in training the decision tree, the concept to construct the charaeteristic data set basing on the fractal dimension are presented: selecting number k of all attributes ordered by information gain according the embeding dimension of the source data set, the method of droping the redundancy attributes and the infection of information Io.xs to decision tree is diseu "ssed. In the experiment, the decesion trees are trained on different characteristic data sets that one is by directly selecting some attributes from source data set and another is made up, analysing the results from two decision trees applied to test data set proves the method is effective.
出处
《计算机技术与发展》
2009年第12期5-8,12,共5页
Computer Technology and Development
基金
国家高技术研究发展计划项目(863/2007AA01Z448)
江苏省社会科学基金(08TQB007)
关键词
决策树
分形维数
信息增益
数据挖掘
decision tree
fraetal dimension
information gain
data mining