摘要
文本分类是信息处理领域的核心研究内容,在自动检索和文本过滤等研究领域中被广泛使用。本次研究主要是基于Logistic回归模型分类器对藏文文本进行分类,其核心思想是首先对藏文语料进行收集和预处理,且利用信息增益算法和欧式距离分别对文本特征进行选择与提取;其次构造Logistic回归模型分类器;最后测试和分析分类的准确率、召回率和F1值,同时,对Logistic算法和Gaussian NB算法进行分类性能对比,结果显示Logistic算法具有较好的分类效果。
Text categorization is a core research content,in the field of information processing in an automated retrieval and text filtering is widely used in the field of study.Research and implementation of the classifier is mainly based on Logistic regression model classifying Tibetan text,its core idea is to Tibetan corpus collection and pretreatment in the first place,and the use of Euclidean distance and information gain algorithm of text feature selection and extraction respectively;Then the Logistic regression model to construct classifiers;Finally,the classification accuracy of the test and analysis,the recall rate and F1 value,as well as the Logistic algorithm and GaussianNB algorithm classification performance comparison,results show that the Logistic algorithm has better classification effect.
作者
群诺
贾宏云
Qun Nuo;Jia Hongyun(Academy of Information Science and Technology,Tibet University,Lhasa Tibet 850000,China)
出处
《信息与电脑》
2018年第5期70-73,共4页
Information & Computer
基金
西藏自治区科技计划重大科技专项(项目编号:ZDZX2017000136)
西藏大学"珠峰学者人才发展支持计划"项目