摘要
传统的文本分类算法都是采用期望交叉熵、信息增益和互信息等统计方法,通过设置阈值获取特征集。如果训练集的数据量较大,则容易出现特征项不明确、特征信息丢失等缺陷。为解决上述问题,提出运用"深度学习"中的稀疏自动编码器算法自动提取文本特征,然后结合深度置信网络形成SD算法进行文本分类。实验表明,在训练集较少的情况下,SD算法的分类性能低于传统的支持向量机;但是在处理高维数据时,SD算法则比支持向量机具有较高的准确率和召回率。
Tradition text classification algorithms use the expected cross entropy, information gain and mutual information statistical method to get the feature set, but these methods require setting thresholds. If the training data set is large which prone to feature items is not clear, the feature information loss and other defects. In order to solve the above problem, the sparse autoencoder algorithm is used which belongs to "deep learning" automatically ex- tracts text features, and then combines with the deep belief networks to form SD algorithm for text classification. Experiments show that, in the case of small training set, SD algorithm performs lower than traditional support vector machines, but when dealing with high-dimensional data, SD has higher accuracy and recall rate than support vector machine algorithm.
出处
《科学技术与工程》
北大核心
2013年第31期9422-9426,共5页
Science Technology and Engineering
基金
欠发达地区工业化与信息化融合及其系统动力机制研究(11FJL007)资助