摘要
代码分类是软件开发与管理的基础工作,有利于代码的重用、理解、查找和维护。现有的有监督学习方法需要大量带标签数据作为训练样本,而数据的标注成本很高,针对这一问题,提出了基于预训练的代码分类方法。首先,对代码进行消除空白、去除低频符号等预处理工作;其次,采用一种基于BERT的预训练模型(CodeBERT)在无标注样本上提取代码的语法、语义和上下文语境等相关特征;最后,基于预训练模型在小样本上微调代码分类器。实验结果表明:该方法即使在较小的训练周期也获得了较好的实验结果,其F1值比文本卷积神经网络(Text-Convolutional Neural Networks,Text-CNN)方法提高了约12%。
Code classification is a basic task for software development and management,which is conducive to code reuse,code comprehension,code search and code maintenance.Existing supervised approaches for code classification require a large number of labeled data,and the cost of data annotation is high.To solve this problem,this paper proposes a pre-trained code classification method.Firstly,preprocess the code by eliminating whitespace and low-frequency symbols.Secondly,a pre-trained model based on BERT(CodeBERT)is adopted to extract relevant features such as syntax,semantics,and context of the code on unlabeled samples.Finally,the classification task is finetuned on the basis of the pre-trained model.The experimental results show that this method achieves good experimental results even in small training cycles,and its F1 value is about 12%higher than that of the Text Convolutional Neural Networks(Text-CNN)method.
作者
梁瑶
洪庆成
王霞
谢春丽
LIANG Yao;HONG Qingcheng;WANG Xia;XIE Chunli(School of Computer Science and Technology,Jiangsu Normal University,Xuzhou 221116,China)
出处
《软件工程》
2023年第10期32-35,共4页
Software Engineering
基金
江苏省研究生科研与实践创新计划项目(2021XKT1392)
江苏省高等学校大学生创新创业训练计划(202010320035Z)
江苏省现代教育技术研究课题(2022-R-102067)
江苏省教育科学十四五规划立项课题(D/2021/01/139)。
关键词
代码表征
代码分类
预训练模型
code representation
code classification
pre-trained model