摘要
针对传统的文本分类方法费时且占用大量资源、效率低等问题,提出了结合大数据处理平台Hadoop和中文文本分类,实现支持向量机(SVM)算法的并行化的模型。通过试验数据分析表明,对比采用传统的单机SVM对样本数据进行训练这个方式,基于Hadoop平台而实现的SVM并行化算法能够改善在对大量样本训练时训练时间长的缺陷,并且分类的准确率也有所提高,尤其是对大量文本进行分类时,Hadoop平台下的并行SVM算法较单机SVM算法具有更大的优势。
As the traditional text categorization methods not only being time-consuming, but also take up a lot of resources, and in the low efficiency, propose the combining large data processing platform Hadoop and Chinese text classification to achieve the parallelism model of SVM algorithm. By analyzing the experimental data, compared to the traditional single SVM for sample data for training is in this way, Hadoop platform achieving SVM parallel algorithm can get a large number of training samples in long training time defects, and the text categorization accuracy rate is also increased. The parallel SVM algorithm on Hadoop platform has more advantage than the single SVM algorithm, especially with a large number of text classifications.
出处
《新技术新工艺》
2017年第2期40-43,共4页
New Technology & New Process