Code similarity analysis has become more popular due to its significant applicantions,including vulnerability detection,malware detection,and patch analysis.Since the source code of the software is difficult to obtain...Code similarity analysis has become more popular due to its significant applicantions,including vulnerability detection,malware detection,and patch analysis.Since the source code of the software is difficult to obtain under most circumstances,binary-level code similarity analysis(BCSA)has been paid much attention to.In recent years,many BCSA studies incorporating Al techniques focus on deriving semantic information from binary functions with code representations such as assembly code,intermediate representations,and control flow graphs to measure the similarity.However,due to the impacts of different compilers,architectures,and obfuscations,binaries compiled from the same source code may vary considerably,which becomes the major obstacle for these works to obtain robust features.In this paper,we propose a solution,named UPPC(Unleashing the Power of Pseudo-code),which leverages the pseudo-code of binary function as input,to address the binary code similarity analysis challenge,since pseudocode has higher abstraction and is platform-independent compared to binary instructions.UPPC selectively inlines the functions to capture the full function semantics across different compiler optimization levels and uses a deep pyramidal convolutional neural network to obtain the semantic embedding of the function.We evaluated UPPC on a data set containing vulnerabilities and a data set including different architectures(X86,ARM),different optimization options(O0-O3),different compilers(GCC,Clang),and four obfuscation strategies.The experimental results show that the accuracy of UPPC in function search is 33.2%higher than that of existing methods.展开更多
Ethereum smart contracts are computer programs that are deployed and executed on the Ethereum blockchain to enforce agreements among untrusting parties.Being the most prominent platform that supports smart contracts,E...Ethereum smart contracts are computer programs that are deployed and executed on the Ethereum blockchain to enforce agreements among untrusting parties.Being the most prominent platform that supports smart contracts,Ethereum has been targeted by many attacks and plagued by security incidents.Consequently,many smart contract vulnerabilities have been discovered in the past decade.To detect and prevent such vulnerabilities,different security analysis tools,including static and dynamic analysis tools,have been created,but their performance decreases drastically when codes to be analyzed are constantly being rewritten.In this paper,we propose Eth2Vec,a machine-learning-based static analysis tool that detects smart contract vulnerabilities.Eth2Vec maintains its robustness against code rewrites;i.e.,it can detect vulnerabilities even in rewritten codes.Other machine-learning-based static analysis tools require features,which analysts create manually,as inputs.In contrast,Eth2Vec uses a neural network for language processing to automatically learn the features of vulnerable contracts.In doing so,Eth2Vec can detect vulnerabilities in smart contracts by comparing the similarities between the codes of a target contract and those of the learned contracts.We performed experiments with existing open databases,such as Etherscan,and Eth2Vec was able to outperform a recent model based on support vector machine in terms of well-known metrics,i.e.,precision,recall,and F1-score.展开更多
The traditional similar code detection approaches are limited in detecting semantically similar codes, impeding their applications in practice. In this paper, we have improved the traditional metrics-based approach as...The traditional similar code detection approaches are limited in detecting semantically similar codes, impeding their applications in practice. In this paper, we have improved the traditional metrics-based approach as well as the graph- based approach and presented a metrics-based and graph- based combined approach. First, source codes are represented as augmented system dependence graphs. Then, metrics- based candidate similar code extraction is performed to filter out most of the dissimilar code pairs so as to lower the computational complexity. After that, code normalization is performed on the candidate similar codes to remove code variations so as to detect similar code at the semantic level. Finally, program matching is performed on the normalized control dependence trees to output semantically similar codes. Experiment results show that our approach can detect similar codes with code variations, and it can be applied to large software.展开更多
Commit messages are important complementary information used in understanding code changes. To address message scarcity, some work is proposed for automatically generating commit messages. However, most of these appro...Commit messages are important complementary information used in understanding code changes. To address message scarcity, some work is proposed for automatically generating commit messages. However, most of these approaches focus on generating summary of the changed software entities at the superficial level, without considering the intent behind the code changes (e.g., the existing approaches cannot generate such message:"fixing 'null' pointer exception"). Considering developers often describe the intent behind the code change when writing the messages, we propose ChangeDoc, an approach to reuse existing messages in version control systems for automatical commit message generation. Our approach includes syntax, semantic, pre-syntax, and pre-semantic similarities. For a given commit without messages, it is able to discover its most similar past commit from a large commit repository, and recommend its message as the message of the given commit. Our repository contains half a million commits that were collected from SourceForge. We evaluate our approach on the commits from 10 projects. The results show that 21.5% of the recommended messages by ChangeDoc can be directly used without modification, and 62.8% require minor modifications. In order to evaluate the quality of the commit messages recommended by ChangeDoc, we performed two empirical studies involving a total of 40 participants (10 professional developers and 30 students). The results indicate that the recommended messages are very good approximations of the ones written by developers and often include important intent information that is not included in the messages generated by other tools.展开更多
文摘Code similarity analysis has become more popular due to its significant applicantions,including vulnerability detection,malware detection,and patch analysis.Since the source code of the software is difficult to obtain under most circumstances,binary-level code similarity analysis(BCSA)has been paid much attention to.In recent years,many BCSA studies incorporating Al techniques focus on deriving semantic information from binary functions with code representations such as assembly code,intermediate representations,and control flow graphs to measure the similarity.However,due to the impacts of different compilers,architectures,and obfuscations,binaries compiled from the same source code may vary considerably,which becomes the major obstacle for these works to obtain robust features.In this paper,we propose a solution,named UPPC(Unleashing the Power of Pseudo-code),which leverages the pseudo-code of binary function as input,to address the binary code similarity analysis challenge,since pseudocode has higher abstraction and is platform-independent compared to binary instructions.UPPC selectively inlines the functions to capture the full function semantics across different compiler optimization levels and uses a deep pyramidal convolutional neural network to obtain the semantic embedding of the function.We evaluated UPPC on a data set containing vulnerabilities and a data set including different architectures(X86,ARM),different optimization options(O0-O3),different compilers(GCC,Clang),and four obfuscation strategies.The experimental results show that the accuracy of UPPC in function search is 33.2%higher than that of existing methods.
基金This research was supported in part by the Japan Society for the Promotion of Science KAKENHI Number 22H03591the MEXT"Innovation Platform for Society 5.0"Program Grant Number JPMXP0518071489.
文摘Ethereum smart contracts are computer programs that are deployed and executed on the Ethereum blockchain to enforce agreements among untrusting parties.Being the most prominent platform that supports smart contracts,Ethereum has been targeted by many attacks and plagued by security incidents.Consequently,many smart contract vulnerabilities have been discovered in the past decade.To detect and prevent such vulnerabilities,different security analysis tools,including static and dynamic analysis tools,have been created,but their performance decreases drastically when codes to be analyzed are constantly being rewritten.In this paper,we propose Eth2Vec,a machine-learning-based static analysis tool that detects smart contract vulnerabilities.Eth2Vec maintains its robustness against code rewrites;i.e.,it can detect vulnerabilities even in rewritten codes.Other machine-learning-based static analysis tools require features,which analysts create manually,as inputs.In contrast,Eth2Vec uses a neural network for language processing to automatically learn the features of vulnerable contracts.In doing so,Eth2Vec can detect vulnerabilities in smart contracts by comparing the similarities between the codes of a target contract and those of the learned contracts.We performed experiments with existing open databases,such as Etherscan,and Eth2Vec was able to outperform a recent model based on support vector machine in terms of well-known metrics,i.e.,precision,recall,and F1-score.
基金Acknowledgements This work was supported by the National Natural Science Foundation of China (Grant Nos. 61202092 and 61173021), the Research Fund for the Doctoral Program of Higher Education of China (20112302120052), Research Fund for the Innovative Scholars of Harbin (RC2013QN010001), and Young Colleger Academic Backbone Project of Heilongjiang.
文摘The traditional similar code detection approaches are limited in detecting semantically similar codes, impeding their applications in practice. In this paper, we have improved the traditional metrics-based approach as well as the graph- based approach and presented a metrics-based and graph- based combined approach. First, source codes are represented as augmented system dependence graphs. Then, metrics- based candidate similar code extraction is performed to filter out most of the dissimilar code pairs so as to lower the computational complexity. After that, code normalization is performed on the candidate similar codes to remove code variations so as to detect similar code at the semantic level. Finally, program matching is performed on the normalized control dependence trees to output semantically similar codes. Experiment results show that our approach can detect similar codes with code variations, and it can be applied to large software.
基金This work was(partially)supported by the Key-Area Research and Development Program of Guangdong Province of China under Grant No.2020B010164002the National Natural Science Foundation of China under Grant Nos.61902441,61722214 and 61976061+2 种基金the China Postdoctoral Science Foundation under Grant No.2018M640855the Fundamental Research Funds for the Central Universities of China under Grant Nos.20wkpy06 and 20lgpy129the Opening Project of Guangdong Key Laboratory of Big Data Analysis and Processing under Grant No.202003.
文摘Commit messages are important complementary information used in understanding code changes. To address message scarcity, some work is proposed for automatically generating commit messages. However, most of these approaches focus on generating summary of the changed software entities at the superficial level, without considering the intent behind the code changes (e.g., the existing approaches cannot generate such message:"fixing 'null' pointer exception"). Considering developers often describe the intent behind the code change when writing the messages, we propose ChangeDoc, an approach to reuse existing messages in version control systems for automatical commit message generation. Our approach includes syntax, semantic, pre-syntax, and pre-semantic similarities. For a given commit without messages, it is able to discover its most similar past commit from a large commit repository, and recommend its message as the message of the given commit. Our repository contains half a million commits that were collected from SourceForge. We evaluate our approach on the commits from 10 projects. The results show that 21.5% of the recommended messages by ChangeDoc can be directly used without modification, and 62.8% require minor modifications. In order to evaluate the quality of the commit messages recommended by ChangeDoc, we performed two empirical studies involving a total of 40 participants (10 professional developers and 30 students). The results indicate that the recommended messages are very good approximations of the ones written by developers and often include important intent information that is not included in the messages generated by other tools.