There are lots of code clones appearing in software,which are similar code fragments with each other. In the past decades,researchers have proposed some state-of-the-art methods to detect clones. The code clones have ...There are lots of code clones appearing in software,which are similar code fragments with each other. In the past decades,researchers have proposed some state-of-the-art methods to detect clones. The code clones have showing some relationship with the evolution of software. In order to explore relationships between clones and their evolution,we propose a framework to cluster clones with a Fuzzy C-means clustering method.Firstly,we detect all the clones using Ni Cad,and build the clone genealogies for multiple versions software.Secondly,we extract some metrics to describe the clones and their evolution. Finally,we cluster all clone's vectors,which are generated with the different metrics for different proposes. Experimental results on six open source software packages have shown the relationships among the clone life,the number of change times,the clone pattern and et al. can help developers to understand clones.展开更多
In the recent era of software development,reusing software is one of the major activities that is widely used to save time.To reuse software,the copy and paste method is used and this whole process is known as code cl...In the recent era of software development,reusing software is one of the major activities that is widely used to save time.To reuse software,the copy and paste method is used and this whole process is known as code cloning.This activity leads to problems like difficulty in debugging,increase in time to debug and manage software code.In the literature,various algorithms have been developed to find out the clones but it takes too much time as well as more space to figure out the clones.Unfortunately,most of them are not scalable.This problem has been targeted upon in this paper.In the proposed framework,authors have proposed a new method of identifying clones that takes lesser time to find out clones as compared with many popular code clone detection algorithms.The proposed framework has also addressed one of the key issues in code clone detection i.e.,detection of near-miss(Type-3)and semantic clones(Type-4)with significant accuracy of 95.52%and 92.80%respectively.The present study is divided into two phases,the first method converts any code into an intermediate representation form i.e.,Hashinspired abstract syntax trees.In the second phase,these abstract syntax trees are passed to a novel approach“Similarity-based self-adjusting hash inspired abstract syntax tree”algorithm that helps in knowing the similarity level of codes.The proposed method has shown a lot of improvement over the existing code clones identification methods.展开更多
This article proposes the high-speed and high-accuracy code clone detection method based on the combination of tree-based and token-based methods. Existence of duplicated program codes, called code clone, is one of th...This article proposes the high-speed and high-accuracy code clone detection method based on the combination of tree-based and token-based methods. Existence of duplicated program codes, called code clone, is one of the main factors that reduces the quality and maintainability of software. If one code fragment contains faults (bugs) and they are copied and modified to other locations, it is necessary to correct all of them. But it is not easy to find all code clones in large and complex software. Much research efforts have been done for code clone detection. There are mainly two methods for code clone detection. One is token-based and the other is tree-based method. Token-based method is fast and requires less resources. However it cannot detect all kinds of code clones. Tree-based method can detect all kinds of code clones, but it is slow and requires much computing resources. In this paper combination of these two methods was proposed to improve the efficiency and accuracy of detecting code clones. Firstly some candidates of code clones will be extracted by token-based method that is fast and lightweight. Then selected candidates will be checked more precisely by using tree-based method that can find all kinds of code clones. The prototype system was developed. This system accepts source code and tokenizes it in the first step. Then token-based method is applied to this token sequence to find candidates of code clones. After extracting several candidates, selected source codes will be converted into abstract syntax tree (AST) for applying tree-based method. Some sample source codes were used to evaluate the proposed method. This evaluation proved the improvement of efficiency and precision of code clones detecting.展开更多
Code similarity analysis has become more popular due to its significant applicantions,including vulnerability detection,malware detection,and patch analysis.Since the source code of the software is difficult to obtain...Code similarity analysis has become more popular due to its significant applicantions,including vulnerability detection,malware detection,and patch analysis.Since the source code of the software is difficult to obtain under most circumstances,binary-level code similarity analysis(BCSA)has been paid much attention to.In recent years,many BCSA studies incorporating Al techniques focus on deriving semantic information from binary functions with code representations such as assembly code,intermediate representations,and control flow graphs to measure the similarity.However,due to the impacts of different compilers,architectures,and obfuscations,binaries compiled from the same source code may vary considerably,which becomes the major obstacle for these works to obtain robust features.In this paper,we propose a solution,named UPPC(Unleashing the Power of Pseudo-code),which leverages the pseudo-code of binary function as input,to address the binary code similarity analysis challenge,since pseudocode has higher abstraction and is platform-independent compared to binary instructions.UPPC selectively inlines the functions to capture the full function semantics across different compiler optimization levels and uses a deep pyramidal convolutional neural network to obtain the semantic embedding of the function.We evaluated UPPC on a data set containing vulnerabilities and a data set including different architectures(X86,ARM),different optimization options(O0-O3),different compilers(GCC,Clang),and four obfuscation strategies.The experimental results show that the accuracy of UPPC in function search is 33.2%higher than that of existing methods.展开更多
Reusing code fragments by copying and pasting them with or without minor adaptation is a common activity in software development.As a result,software systems often contain sections of code that are very similar,called...Reusing code fragments by copying and pasting them with or without minor adaptation is a common activity in software development.As a result,software systems often contain sections of code that are very similar,called code clones.Code clones are beneficial in reducing software development costs and development risks.However,recent studies have indicated some negative impacts as a result.In order to effectively manage and utilize the clones,we design an approach for recommending refactoring clones based on a Bayesian network.Firstly,clone codes are detected from the source code.Secondly,the clones that need to be refactored are identified,and the static and evolutions features are extracted to build the feature database.Finally,the Bayesian network classifier is used for training and evaluating the classification results.Based on more than 640 refactor examples of five open source software developed in C,we observe a considerable enhancement.The results show that the accuracy of the approach is larger than 90%.We believe our approach will provide a more accurate and reasonable code refactoring and maintenance advice for software developers.展开更多
Modifying a code segment may give rise to a consistency issue when the code segment belongs to a clone group comprising closely similar code segments.Recent studies have demonstrated that such consistent changes can i...Modifying a code segment may give rise to a consistency issue when the code segment belongs to a clone group comprising closely similar code segments.Recent studies have demonstrated that such consistent changes can incur extra maintenance costs when clones are checked for consistency and introduce defects if developers forget to change clones consistently when needed.To address this problem,researchers have proposed an approach to predict clone consistency in advance with handcrafted attributes,notably using machine learning methods.Although these attributes can help predict clone consistency to some extent,the capability of such an approach is generally weak and unsatisfactory in practice.Such limitations in capability are especially severe at a project's infancy stage when there is not sufficient within-project data to model clone consistency behavior,and cross-project data have not been helpful in supporting prediction.In this paper,we propose the Clone Hierarchical Attention Neural Network(CHANN)to represent code clones and their evolution by adopting a hierarchical perspective of code,context,and code evolution,and thus enhancing the effectiveness of clone consistency prediction.To assess the effectiveness of CHANN,we conduct experiments on the dataset collected from eight open-source projects.The experimental results show that CHANN is highly effective in predicting clone consistency,and the precision,recall,and F-measure attained in prediction are around 82%.These findings support our hypothesis that the hierarchical neural network can help developers predict clone consistency effectively in the case of cross-project incubation when insufficient data are available at the early stage of software development.展开更多
基金Sponsored by the National Natural Science Foundation of China(Grant No.61173021)
文摘There are lots of code clones appearing in software,which are similar code fragments with each other. In the past decades,researchers have proposed some state-of-the-art methods to detect clones. The code clones have showing some relationship with the evolution of software. In order to explore relationships between clones and their evolution,we propose a framework to cluster clones with a Fuzzy C-means clustering method.Firstly,we detect all the clones using Ni Cad,and build the clone genealogies for multiple versions software.Secondly,we extract some metrics to describe the clones and their evolution. Finally,we cluster all clone's vectors,which are generated with the different metrics for different proposes. Experimental results on six open source software packages have shown the relationships among the clone life,the number of change times,the clone pattern and et al. can help developers to understand clones.
文摘In the recent era of software development,reusing software is one of the major activities that is widely used to save time.To reuse software,the copy and paste method is used and this whole process is known as code cloning.This activity leads to problems like difficulty in debugging,increase in time to debug and manage software code.In the literature,various algorithms have been developed to find out the clones but it takes too much time as well as more space to figure out the clones.Unfortunately,most of them are not scalable.This problem has been targeted upon in this paper.In the proposed framework,authors have proposed a new method of identifying clones that takes lesser time to find out clones as compared with many popular code clone detection algorithms.The proposed framework has also addressed one of the key issues in code clone detection i.e.,detection of near-miss(Type-3)and semantic clones(Type-4)with significant accuracy of 95.52%and 92.80%respectively.The present study is divided into two phases,the first method converts any code into an intermediate representation form i.e.,Hashinspired abstract syntax trees.In the second phase,these abstract syntax trees are passed to a novel approach“Similarity-based self-adjusting hash inspired abstract syntax tree”algorithm that helps in knowing the similarity level of codes.The proposed method has shown a lot of improvement over the existing code clones identification methods.
文摘This article proposes the high-speed and high-accuracy code clone detection method based on the combination of tree-based and token-based methods. Existence of duplicated program codes, called code clone, is one of the main factors that reduces the quality and maintainability of software. If one code fragment contains faults (bugs) and they are copied and modified to other locations, it is necessary to correct all of them. But it is not easy to find all code clones in large and complex software. Much research efforts have been done for code clone detection. There are mainly two methods for code clone detection. One is token-based and the other is tree-based method. Token-based method is fast and requires less resources. However it cannot detect all kinds of code clones. Tree-based method can detect all kinds of code clones, but it is slow and requires much computing resources. In this paper combination of these two methods was proposed to improve the efficiency and accuracy of detecting code clones. Firstly some candidates of code clones will be extracted by token-based method that is fast and lightweight. Then selected candidates will be checked more precisely by using tree-based method that can find all kinds of code clones. The prototype system was developed. This system accepts source code and tokenizes it in the first step. Then token-based method is applied to this token sequence to find candidates of code clones. After extracting several candidates, selected source codes will be converted into abstract syntax tree (AST) for applying tree-based method. Some sample source codes were used to evaluate the proposed method. This evaluation proved the improvement of efficiency and precision of code clones detecting.
文摘Code similarity analysis has become more popular due to its significant applicantions,including vulnerability detection,malware detection,and patch analysis.Since the source code of the software is difficult to obtain under most circumstances,binary-level code similarity analysis(BCSA)has been paid much attention to.In recent years,many BCSA studies incorporating Al techniques focus on deriving semantic information from binary functions with code representations such as assembly code,intermediate representations,and control flow graphs to measure the similarity.However,due to the impacts of different compilers,architectures,and obfuscations,binaries compiled from the same source code may vary considerably,which becomes the major obstacle for these works to obtain robust features.In this paper,we propose a solution,named UPPC(Unleashing the Power of Pseudo-code),which leverages the pseudo-code of binary function as input,to address the binary code similarity analysis challenge,since pseudocode has higher abstraction and is platform-independent compared to binary instructions.UPPC selectively inlines the functions to capture the full function semantics across different compiler optimization levels and uses a deep pyramidal convolutional neural network to obtain the semantic embedding of the function.We evaluated UPPC on a data set containing vulnerabilities and a data set including different architectures(X86,ARM),different optimization options(O0-O3),different compilers(GCC,Clang),and four obfuscation strategies.The experimental results show that the accuracy of UPPC in function search is 33.2%higher than that of existing methods.
基金This work was supported by the National Natural Science Foundation(61363017)of China.The author is Liu,D.S.and the website is https://isisn.nsfc.gov.cn.
文摘Reusing code fragments by copying and pasting them with or without minor adaptation is a common activity in software development.As a result,software systems often contain sections of code that are very similar,called code clones.Code clones are beneficial in reducing software development costs and development risks.However,recent studies have indicated some negative impacts as a result.In order to effectively manage and utilize the clones,we design an approach for recommending refactoring clones based on a Bayesian network.Firstly,clone codes are detected from the source code.Secondly,the clones that need to be refactored are identified,and the static and evolutions features are extracted to build the feature database.Finally,the Bayesian network classifier is used for training and evaluating the classification results.Based on more than 640 refactor examples of five open source software developed in C,we observe a considerable enhancement.The results show that the accuracy of the approach is larger than 90%.We believe our approach will provide a more accurate and reasonable code refactoring and maintenance advice for software developers.
基金supported by the National Natural Science Foundation of China under Grant Nos.U20A6003 and 62237001the Guangdong Science and Technology Plan Project under Grant No.2021B1212100004+1 种基金the Guangdong Natural Science Fund Project under Grant No.2021A1515011243Guangdong Joint Fund of the National Natural Science Foundation of China underGrant Nos.U1801263and U1701262.
文摘Modifying a code segment may give rise to a consistency issue when the code segment belongs to a clone group comprising closely similar code segments.Recent studies have demonstrated that such consistent changes can incur extra maintenance costs when clones are checked for consistency and introduce defects if developers forget to change clones consistently when needed.To address this problem,researchers have proposed an approach to predict clone consistency in advance with handcrafted attributes,notably using machine learning methods.Although these attributes can help predict clone consistency to some extent,the capability of such an approach is generally weak and unsatisfactory in practice.Such limitations in capability are especially severe at a project's infancy stage when there is not sufficient within-project data to model clone consistency behavior,and cross-project data have not been helpful in supporting prediction.In this paper,we propose the Clone Hierarchical Attention Neural Network(CHANN)to represent code clones and their evolution by adopting a hierarchical perspective of code,context,and code evolution,and thus enhancing the effectiveness of clone consistency prediction.To assess the effectiveness of CHANN,we conduct experiments on the dataset collected from eight open-source projects.The experimental results show that CHANN is highly effective in predicting clone consistency,and the precision,recall,and F-measure attained in prediction are around 82%.These findings support our hypothesis that the hierarchical neural network can help developers predict clone consistency effectively in the case of cross-project incubation when insufficient data are available at the early stage of software development.