摘要
针对传统卡方统计量方法对特征项的频数和类别分布考虑不足的缺陷,提出了一种结合余弦相似度的卡方统计量特征选择方法。该方法首先使用均值词频-逆文档频率表示特征项,通过引入一个调整公式来平衡类间选取的特征项数,从而对传统卡方统计量方法进行修正,然后结合余弦相似度进一步消除噪声文本。在收集的维吾尔文数据集上进行实验论证。实验结果表明:改进的卡方统计量方法具有较好的鲁棒性,且分类性能优于传统的卡方统计量方法。
In order to deal with the insufficient consideration of the traditional chi-square statistic method in thefrequency and category distribution of feature items,a new Chi-square statistic feature selection method combined with the cosine similarity was proposed. Firstly,the mean term frequency-inverse document frequency( TF-IDF) was used to represent the features,and the selected feature items was balanced by introducing a adjustment formula. Thus the traditional chi-square statistic method was modified. Then the noise text was eliminated further by cosine similarity. Finally,a demonstration experiment was established on the collected Uyghur data set. The results show that the improved chi-square test method has better robustness. The classification performance is superior to the traditional chi-square statistic method.
出处
《河南科技大学学报(自然科学版)》
CAS
北大核心
2016年第3期42-46,6-7,共5页
Journal of Henan University of Science And Technology:Natural Science
基金
国家自然科学基金项目(61163026
60865001)