XIE Chunli, LIN Jiangxu, LIU Xiaoyang, ZHANG Wenbin, HUANG Junwei. A Source Code Similarity Approach Based on Improved Convolutional Neural Networks[J]. Applied Mathematics and Mechanics, 2019, 40(11): 1235-1245. doi: 10.21656/1000-0887.400221
Citation: XIE Chunli, LIN Jiangxu, LIU Xiaoyang, ZHANG Wenbin, HUANG Junwei. A Source Code Similarity Approach Based on Improved Convolutional Neural Networks[J]. Applied Mathematics and Mechanics, 2019, 40(11): 1235-1245. doi: 10.21656/1000-0887.400221

A Source Code Similarity Approach Based on Improved Convolutional Neural Networks

doi: 10.21656/1000-0887.400221
Funds:  The National Natural Science Foundation of China(61773185;61877030;61502212)
  • Received Date: 2019-07-22
  • Rev Recd Date: 2019-09-23
  • Publish Date: 2019-11-01
  • The source code similarity refers to the functional similarity of different code segments, which touches off important research in the field of software engineering. The existing methods mainly extracted texts and structure features manually from source codes to calculate the similarity based on the statistical information in disregard of the semantic characteristics of codes. To solve this problem, a source code similarity detection method based on the CNN was proposed. First, the source code was represented through word embedding to obtain the vector information of word embedding. Second, the CNN training model was constructed to learn the embedded representation of source code documents. Finally, the cosine similarity value of source code pairs was calculated. Experiments show that, the proposed method can certainly improve the performance with respect to the semantic similarity of source codes.
  • loading
  • [1]
    KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: a multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering,2002,28(7): 654-670.
    [2]
    BELLON S, KOSCHKE R, ANTONIOL G, et al. Comparison and evaluation of clone detection tools[J].IEEE Transactions on Software Engineering,2007,33(9): 577-591.
    [3]
    LIU C, CHEN C, HAN J,et al. GPLAG: detection of software plagiarism by program dependence graph analysis[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Philadelphia, PA, USA, 2006.
    [4]
    COSMA G, JOY M. Towards a definition of source-code plagiarism[J]. IEEE Transactions on Education,2008,51(2): 195-200.
    [5]
    COSMA G, JOY M. An approach to source-code plagiarism detection and investigation using latent semantic analysis[J]. IEEE Transactions on Computers,2012,61(3): 379-394.
    [6]
    MENS K, LOZANO A. Source Code-Based Recommendation Systems: Recommendation Systems in Software Engineering[M]. Springer, 2014: 93-130.
    [7]
    MCMILLAN C, POSHYVANYK D, GRECHANIK M,et al. Portfolio: searching for relevant functions and their usages in millions of lines of code[J]. ACM Transactions on Software Engineering and Methodology,2013,22(4): 1-30. DOI: 10.1145/2522920.2522930.
    [8]
    RAGKHITWETSAGUL C, KRINKE J, CLARK D. A comparison of code similarity analysers[J]. Empirical Software Engineering,2017,23: 2464-2517.
    [9]
    ROY C K, CORDY J R. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]// Proceedings of IEEE International Conference on Program Comprehension.2008: 172-181.
    [10]
    BAXTER I D, YAHIN A, MOURA L, et al. Clone detection using abstract syntax trees[C]//Proceedings of the Conference on Reverse Engineering.Benevento, Italy, 2006: 368-377.
    [11]
    CHAE D K, HA J, KIM S W,et al. Software plagiarism detection: a graph-based approach[C]//Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management.ACM, 2013: 1577-1580.
    [12]
    HINDLE A, BARR E T, SU Z. On the naturalness of software[C]//2012 34th International Conference on Software Engineering (ICSE).Zurich, Switzerland, 2012: 837-847.
    [13]
    KARAIVANOV S, RAYCHEV V, VECHEV M T. Phrase-based statisticaltranslation of programming languages[C]//Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software.Portland, Oregon, USA, 2014: 173-184.
    [14]
    RAYCHEV V, VECHEV M, YAHAV E. Code completion with statistical language models[C]// Proceedings of the 35th ACM Sigplan Conference on Programming Language Design and Implementation.Edinburgh, United Kingdom, 2014: 419-428.
    [15]
    NGUYEN A T, NGUYEN T T, NGUYEN T N. Divide-and-conquer approach for multi-phase statistical migration for source code(T)[C]// Proceedings of the IEEE/ACM International Conference on Automated Software Engineering.Lincoln, NE, USA, 2016: 585-596.
    [16]
    张峰逸, 彭鑫, 陈驰, 等. 基于深度学习的代码分析研究综述[J]. 计算机应用与软件, 2018,35(6): 9-17.(ZHANG Fengyi, PENG Xin, CHEN Chi, et al. Research on code analysis based on deep learning[J]. Computer Applications and Software, 2018,35(6): 9-17.(in Chinese))
    [17]
    陈秋远, 李善平, 鄢萌, 等. 代码克隆检测研究进展[J]. 软件学报, 2019,30(4): 962-980.(CHEN Qiuyuan, LI Shanping, YAN Meng, et al. Code clone detection: a literature review[J]. Journal of Software,2019,30(4): 962-980.(in Chinese))
    [18]
    TUFANO M, WATSON C, GABRIELE B, et al. Deep learning similarities from different representations of source code[C]// Proceedings of the 15th International Conference on Mining Software Repositories.New York, USA, 2018: 542-553.
    [19]
    HELLENDOORN V J , DEVANBU P. Are deep neural networks the best choice for modeling source code?[C]//Proceedings of the 11th Joint Meeting.Paderborn, Germany, 2017: 763-773.
    [20]
    HALSTEAD M H. Elements of Software Science[M]. New York: Elsevier North-Holland, 1977.
    [21]
    KOMONDOOR R, HORWITZ S. Using slicing to identify duplication in source code[C]// Proceedings of International Symposium on Static Analysis.Berlin, Heidelberg, 2001.
    [22]
    ARROYO-FERNNDEZ I, MNDEZ-CRUZ C F, SIERRA G, et al. Unsupervised sentence representations as word information series: revisiting TF-IDF[J]. Computer Speech & Language,2019,56: 107-129.
    [23]
    何绪飞, 艾剑良, 宋智桃. 多元数据融合在无人机结构-健康监测中的应用[J]. 应用数学和力学, 2018,〖STHZ〗 39(4): 395-402.(HE Xufei, AI Jianliang, SONG Zhitao. Multi-source data fusion for health monitoring of unmanned aerial vehicle structures[J]. Applied Mathematics and Mechanics,2018,39(4): 395-402.(in Chinese))
    [24]
    NGUYEN A T, NGUYEN T D, PHAN H D,et al. A deep neural network language model with contexts for source code[C]// Proceedings of IEEE International Conference on Software Analysis.Campobasso, Italy, 2018: 323-334.
    [25]
    OTTENSTEIN K J. An algorithmic approach to the detection and prevention of plagiarism[J]. ACM SIGCSE Bulletin,1976,8(4): 30-41.
    [26]
    WHITE M, TUFANO M, VENDOME C,et al. Deep learning code fragments for code clone detection[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).Singapore, 2016: 87-98.
    [27]
    LAM A N, NGUYEN A T, NGUYEN H A,et al. Combining deep learning with information retrieval to localize buggy files for bug reports[C]// Proceedings of 2015 30th IEEE/ACM International Conference on Automated Software Engineering(ASE).Lincoln, NE, USA, 2015: 476-481.
    [28]
    HUO X, THUNG F, LI M. Deep transfer bug localization[J]. IEEE Transactions on Software Engineering,2019. DOI: 10.1109/TSE.2019.2920771.
    [29]
    MOU L, LI G, JIN Z, et al. TBCNN: a Tree-Based Convolutional Neural Network for Programming Language Processing[M]. Eprint Arxiv, 2014.
    [30]
    WHITE M, TUFANO M, MARTNEZ M,et al. Sorting and transforming program repair ingredients via deep learning code similarities[C]//Proceedings of 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).Hangzhou, China, 2019: 479-490.
    [31]
    MIKOLOV T, SUTSKEVER I, KAI C, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems,2013,26: 3111-3119.
    [32]
    YE X, SHEN H, MA X, et al. From word embeddings to document similarities for improved information retrieval in software engineering[C]//Proceeding of IEEE/ACM International Conference on Software Engineering.2016.
    [33]
    NGUYEN T D, NGUYEN A T, PHAN H D, et al. Exploring API embedding for api usages and applications[C]// Proceedingof IEEE/ACM International Conference on Software Engineering.Buenos Aires, Argentina, 2017: 438-449.
    [34]
    CHEN C, XING Z, WANG X. Unsupervised software-specific morphological forms inference from informal discussions[C]// Proceeding of IEEE/ACM International Conference on Software Engineering.Buenos Aires, Argentina, 2017: 450-461.
    [35]
    HAO P, MOU L, GE L, et al. Building program vector representations for deep learning[C]//Proceeding of International Conference on Knowledge Science.2015: 547-553.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (1198) PDF downloads(387) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return