摘要
针对招标文件中因数据稀疏导致的特征提取困难影响分类准确率的问题,提出了一种基于极端梯度提升(eXtreme gradient boosting,XGBoost)和文本聚焦表示模型的分类方法。聚焦表示部分通过提取对分类结果有显著影响的关键字段部分,使用N-Gram分词,结合词性级词频-逆文档频率(term frequency–inverse document frequency,TF-IDF)的方法,实现招标文件文本特征向量表示;基于XGBoost的招标文件分类预测模型部分将提取到的特征送入XGBoost模型,实现了将招标文件按照行业分类和按照项目类型分类。结果表明:聚焦表示模型与计数向量和TF-IDF文本表示模型相比,其特征提取的效果更好;同时,通过人工标注语料的验证表明,8种行业分类准确率高达95.3%,按照项目类型的分类准确率达到96.6%左右。与其他分类算法比较,XGBoost分类算法表现更优。
Aiming at the problem that the difficulty of feature extraction caused by sparse data in bidding documents affects the classification accuracy, a classification method based on eXtreme gradient boosting(XGBoost) and text focus representation model is proposed.The focused representation part is to extract the key field parts that have a significant impact on the classification results, use N-Gram word segmentation, and combine the part of speech level term frequency-inverse document frequency(TF-IDF) method to realize the text feature vector representation of the bidding documents;at the part of the bidding document classification prediction model based on XGBoost, the extracted features are sent into the XGBoost model, the bidding documents are classified according to industry and project types.The experimental results show that the focused representation model has a better feature extraction effect than the count vector and TF-IDF text representation model.At the same time,through the verification of the manual annotation corpus,the classification accuracy rate of 8 industries is as high as 95.3%,and the classification accuracy rate according to the project type of XGBoost reaches about 96.6%. Compared with other classification algorithms, the XGBoost classification algorithm performs better.
作者
闫吉庆
沈志远
吕靖
刘金硕
YAN Jiqing;SHEN Zhiyuan;LÜJing;LIU Jinshuo(China Shenhua International Engineering Gompany,Beijing 100007,China;School of Cyber Science and Engineering,Wuhan University,Wuhan 430072,China)
出处
《武汉大学学报(工学版)》
CAS
CSCD
北大核心
2022年第3期310-318,共9页
Engineering Journal of Wuhan University
关键词
文本分类
文本表示
XGBoost
聚焦模型
text classification
text representation
eXtreme gradient boosting
focus model