Exclusive: Theory and Application of Cyberspace Geography

Survey of vulnerability detection based on graph deep learning

  • DONG Jiping ,
  • GUO Qiquan ,
  • GAO Chundong ,
  • HAO Mengmeng ,
  • JIANG Dong
Expand
  • 1. Institute of Geographic Sciences and Nature Resources Research, Chinese Academy of Sciences, Beijing 100101, China
    2. Laboratory of Cyberspace Geography, Chinese Academy of Sciences and The Ministry of Public Security of the People's Republic of China, Beijing 100101, China
    3. College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100190, China

Received date: 2022-10-31

  Revised date: 2022-11-19

  Online published: 2023-08-11

Abstract

The recent advances made by graph-based deep learning have demonstrated its great potential in processing non-Euclidean structured data, and a large number of research efforts have attempted to apply graph embeddings or graph neural networks to vulnerability detection. This survey systematically investigates the vulnerability detection based on graph deep learning. Firstly, we summarize the four main stages of the vulnerability detection process, including data set, graph data preparation, graph deep learning model construction, and result evaluation. Then, starting from the effectiveness of graph-based deep learning vulnerability detection, we respectively expound the research results based on code patterns, code similarity and specific application scenarios. Finally, by sorting out and summarizing the existing research works, we analyze the challenges and foresee the trends in this research field.

Cite this article

DONG Jiping , GUO Qiquan , GAO Chundong , HAO Mengmeng , JIANG Dong . Survey of vulnerability detection based on graph deep learning[J]. Science & Technology Review, 2023 , 41(13) : 41 -59 . DOI: 10.3981/j.issn.1000-7857.2023.13.005

References

[1] Araba V-P M, Addai P, Isteefanos S, et al. Survey on types of cyber attacks on operating system vulnerabilities since 2018 onwards[C]//2022 IEEE World AI IoT Congress (AIIoT). Seattle, WA, USA: IEEE, 2022: 1-7.
[2] Eceiza M, Flores J L, Iturbe M. Fuzzing the internet of things: A review on the techniques and challenges for eff icient vulnerability discovery in embedded systems[J]. IEEE Internet of Things Journal, 2021, 8(13): 10390-10411.
[3] Aydos M, Aldan C, Coşkun E, et al. Security testing of web applications: A systematic mapping of the literature[J]. Journal of King Saud University-Computer and Information Sciences, 2022, 34(9): 6775-6792.
[4] Hanif H, Md Nasir M H N, Ab Razak M F, et al. The rise of software vulnerability: Taxonomy of software vulnerabilities detection and machine learning approaches[J].Journal of Network and Computer Applications, 2021, 179: 103009.
[5] Shen Z, Chen S. A survey of automatic software vulnerability detection program repair and defect prediction techniques[J]. Security and Communication Networks, 2020, 2020: 1-16.
[6] 顾绵雪, 孙鸿宇, 韩丹, 等. 基于深度学习的软件安全漏洞挖掘[J]. 计算机研究与发展, 2021, 58(10): 2140-2162.
[7] National vulnerability of database. CVSS severity distribution over time[EB/OL]. [2022-07-23]. https://nvd.nist.gov/general/ visualizations/vulnerability-visualizations.
[8] Wikipedia. Log4Shell[EB/OL]. (2023-01-21) [2023-05-23]. https://en.wikipedia.org/wiki/Log4Shell.
[9] Ye G, Tang Z, Wang H, et al. Deep program structure modeling through multi-relational graph-based learning[C]//The 29th ACM/IEEE International Conference on Parallel Architectures and Compilation Techniques (PACT). New York, USA: Association for Computing Machinery, 2020: 111-123.
[10] Tian X, Ku W S. Geometric graph representation learning on protein structure prediction[C]//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'21). New York, USA: Association for Computing Machinery, 2021: 1873-1883.
[11] Ahmedt-Aristizabal D, Armin M A, Denman S, et al. Graph-based deep learning for medical diagnosis and analysis: Past present and future[J]. Sensors, 2021, 21(14): 4758.
[12] Ganz T, Härterich M, Warnecke A, et al. Explaining graph neural networks for vulnerability discovery[C]//Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security. New York, USA: Association for Computing Machinery, 2021: 145-156.
[13] Li G. Source code vulnerability mining method based on graph neural network[J]. International Journal of Frontiers in Engineering Technology, 2022, 4(4): 21-32.
[14] Jie G, Xiao H K, Qiang L. Survey on software vulnerability analysis method based on machine learning[C]//2016 IEEE First International Conference on Data Science in Cyberspace(DSC). Changsha, China: IEEE, 2016: 642-647.
[15] Bahaa Farid A, Kamal A, Ghoneim A. A systematic literature review on software vulnerability detection using machine learning approaches[J]. Informatics Bulletin, Faculty of Computers and Artificial Intelligence, 2022, 4(1): 1-9.
[16] Alaoui R L, Nfaoui E H. Deep learning for vulnerability and attack detection on web applications: A systematic literature review[J]. Future Internet, 2022, 14(4): 118.
[17] 李韵, 黄辰林, 王中锋, 等 . 基于机器学习的软件漏洞挖掘方法综述[J]. 软件学报, 2020, 31(7): 2040-2061.
[18] Semasaba A O A, Zheng W, Wu X, et al. Literature survey of deep learning-based vulnerability analysis on source code[J]. IET Software, 2020, 14(6): 654-664.
[19] Sonnekalb T, Heinze T S, Mäder P. Deep security analysis of program code[J]. Empirical Software Engineering, 2022, 27(1): 2.
[20] Lin G, Wen S, Han Q L, et al. Software vulnerability detection using deep neural networks: A survey[J]. Proceedings of the IEEE, 2020, 108(10): 1825-1848.
[21] Okun V, Delaitre A M, Black P E. Report on the static analysis tool exposition (SATE) IV. Special Publication, National Institute of Standards and Technology, Gaithersburg, MD[EB/OL]. [2022-07-23]. https://doi.org/10.6028/NIST.SP.500-297.
[22] NIST software assurance reference dataset. National Institute of Standards and Technology[EB/OL]. [2022-07-23]. https://samate.nist.gov/SARD.
[23] Booth H, Rike D, Witte G A. The national vulnerability database (NVD): Overview. ITL Bulletin, National Institute of Standards and Technology, Gaithersburg, MD[EB/OL]. [2022-07-23]. https://tsapps. nist. gov/publication/get_pdf.cfm?pub_id=915172.
[24] CVE Details. The ultimate security vulnerability data source[EB/OL]. [2022-07-23]. https://www. cvedetails. com/.
[25] Li Z, Zou D, Xu S, et al. SySeVR: A framework for using deep learning to detect software vulnerabilities[J]. IEEE Transactions on Dependable and Secure Computing, 2022, 19(4): 2244-2258.
[26] Russell R, Kim L, Hamilton L, et al. Automated vulnerability detection in source code using deep representation learning[C]//2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). Orlando, FL, USA: IEEE, 2018: 757-762.
[27] Rabheru R, Hanif H, Maffeis S. DeepTective: Detection of PHP vulnerabilities using hybrid graph neural networks[C]//Proceedings of the 36th Annual ACM Symposium on Applied Computing. New York, USA: Association for Computing Machinery, 2021: 1687-1690.
[28] Cheng X, Wang H, Hua J, et al. DeepWukong: Statically detecting software vulnerabilities using deep graph neural network[J]. ACM Transactions on Software Engineering and Methodology, 2021, 30(3): 1-33.
[29] Cao S, Sun X, Bo L, et al. MVD: Memory-related vulnerability detection based on flow-sensitive graph neural networks[C]//Proceedings of the 44th International Conference on Software Engineering (ICSE'22). New York, USA: Association for Computing Machinery, 2022: 1456-1468.
[30] OpenSSL. Cryptography and SSL/TLS Toolkit[EB/OL]. [2022-07-23]. https://www.openssl.org/.
[31] Xu X, Liu C, Feng Q, et al. Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York, USA: Association for Computing Machinery, 2017: 363-376.
[32] Zhou Y, Liu S, Siow J, et al. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks[J]. arXiv preprint, 2019, arXiv:1909.03496.
[33] Fan J, Li Y, Wang S, et al. A C/C++ code vulnerability dataset with code changes and CVE summaries[C]//Proceedings of the 17th International Conference on Mining Software Repositories. New York, USA: Association for Computing Machinery, 2020: 508-512.
[34] Chakraborty S, Krishna R, Ding Y, et al. Deep learning based vulnerability detection: Are we there yet[J]. IEEE Transactions on Software Engineering, 2021, 48(9): 3280-3296.
[35] Cao S, Sun X, Bo L, et al. BGNN4VD: Constructing bidirectional graph neural-network for vulnerability detection[J]. Information and Software Technology, 2021, 136(C): 106576.
[36] Şahin S E, Özyedierler E M, Tosun A. Predicting vulnerability inducing function versions using node embeddings and graph neural networks[J]. Information and Software Technology, 2022, 145(C): 16.
[37] Zheng W, Jiang Y, Su X. VulSPG: Vulnerability detection based on slice property graph representation learning[J]. arXiv preprint, 2021, arXiv:2109.02527.
[38] Suneja S, Zheng Y, Zhuang Y, et al. Learning to map source code to software vulnerability using code-as-a-graph[J]. arXiv preprint, 2020, arXiv:2006.08614.
[39] Zhuang Y, Suneja S, Thost V, et al. Software vulnerability detection via deep learning over disaggregated code graph representation[J]. arXiv preprint, 2021, arXiv: 2109.03341.
[40] Sun H N, Xie J T, Lin B, et al. Large-scale firmware vulnerability analysis based on code similarity[C]//2021 IEEE International Conference on Power Intelligent Computing and Systems (ICPICS). Shenyang, China: IEEE, 2021: 184-189.
[41] Ji Y, Cui L, Huang H H. Vestige: Identifying binary code provenance for vulnerability detection[C]//Applied Cryptography and Network Security: 19th International Conference, ACNS 2021. Kamakura, Japan: Springer-Verlag, 2021: 287-310.
[42] Baldoni R, Luna G A D, Massarelli L, et al. Unsupervised features extraction for binary similarity using graph embedding neural networks[J]. arXiv preprint, 2018, arXiv:1810.09683.
[43] Ji Y D, Cui L, Huang H H. BugGraph: Differentiating source-binary code similarity with graph triplet-loss network[C]//Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security (ASIA CCS'21). New York, USA: Association for Computing Machinery, 2021: 702-715.
[44] Wang S, Jiang X, Yu X, et al. Cross-platform binary code homology analysis based on GRU graph embedding[J]. Security and Communication Networks, 2021, 2021: 1-8.
[45] Liu S. A unified framework to learn program semantics with graph neural networks[C]//Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. New York, USA: Association for Computing Machinery, 2020: 1364-1366.
[46] Li M, Li C, Li S, et al. ACGVD: Vulnerability detection based on comprehensive graph via graph neural network with attention[C]//Information and Communications Security: 23rd International Conference, ICICS 2021. Chongqing, China: Springer-Verlag, 2021: 243-259.
[47] Nguyen V A, Nguyen D Q, Nguyen V, et al. ReGVD: Re⁃visiting graph neural networks for vulnerability detection[J]. arXiv preprint, 2022, arXiv:2110.07317.
[48] Li Y, Wang S, Nguyen T N. Vulnerability detection with fine-grained interpretations[C]//Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). New York, USA:
Association for Computing Machinery, 2021: 292-303.
[49] Hin D, Kan A, Chen H, et al. LineVD: Statement-level vulnerability detection using graph neural networks[C]//2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). Pittsburgh, PA, USA: Association for Computing Machinery, 2022: 596-607.
[50] Ghaffarian S M, Shahriari H R. Neural software vulnerability analysis using rich intermediate graph representations of programs[J]. Information Sciences, 2021, 553: 189-207.
[51] 段旭, 吴敬征, 罗天悦, 等 . 基于代码属性图及注意力双向 LSTM 的漏洞挖掘方法[J]. 软件学报, 2020, 31(11): 3404-3420.
[52] Renjith G, Aji S. Vulnerability analysis and detection using graph neural networks for android operating system[C]//Information Systems Security: 17th International Conference, ICISS 2021. Patna, India: Springer-Verlag, 2021: 57-72.
[53] Song Z, Wang J, Liu S, et al. HGVul: A code vulnerability detection method based on heterogeneous source-level intermediate representation[J]. Security and Communication Networks, 2022: 1-13.
[54] Feng Q, Zhou R, Xu C, et al. Scalable graph-based bug search for firmware images[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS'16). New York, USA: Association for Computing Machinery, 2016: 480-491.
[55] Wu Y, Lu J, Zhang Y, et al. Vulnerability detection in C/C++ source code with graph representation learning[C]//2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC). New York, USA: Association for Computing Machinery, 2021: 1519-1524.
[56] Wang H, Ye G, Tang Z, et al. Combining graph-based learning with automated data collection for code vulnerability detection[J]. IEEE Transactions on Information Forensics and Security, 2021, 16: 1943-1958.
[57] Wang Y, Hou Y, Che W, et al. From static to dynamic word representations: A survey[J]. International Journal of Machine Learning and Cybernetics, 2020, 11(7): 1611-1630.
[58] Le Q V, Mikolov T. Distributed representations of sentences and documents[J]. arXiv preprint, 2014, arXiv: 1405.4053.
[59] Wu Z, Pan S, Chen F, et al. A comprehensive survey on graph neural networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4-24.
[60] Goyal P, Ferrara E. Graph embedding techniques applications and performance: A survey[J]. Knowledge-Based Systems, 2018, 151: 78-94.
[61] Cai H, Zheng V W, Chang K C C. A comprehensive survey of graph embedding: Problems techniques and applications[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(9): 1616-1637.
[62] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'14). New York, USA: Association for Computing Machinery, 2014: 701-710.
[63] Grover A, Leskovec J. Node2vec: Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16). New York, USA: Association for Computing Machinery, 2016: 855-864.
[64] Dong Y, Chawla N V, Swami A. Metapath2vec: Scalable representation learning for heterogeneous networks[C]//The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'17). New York, USA: Association for Computing Machinery, 2017: 135-144.
[65] Hamilton W L. Graph representation learning[J]. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2020, 14(3): 1-159.
[66] Patil S S. Automated vulnerability detection in java source code using J-CPG and graph neural network[D]. Netherlands: University of Twente, 2021: 21-28.
[67] Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks[J]. arXiv preprint, 2017, arXiv:1609.02907.
[68] Wu S, Sun F, Zhang W, et al. Graph neural networks in recommender systems: A survey[J]. ACM Computing Surveys, 2022, 55(5): 1-37.
[69] Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs[C]//Proceedings of the 31st International Conference on Neural Information Pro⁃cessing Systems (NIPS'17). New York, USA: Curran Associates, 2017: 1025-1035.
[70] Veličković P, Cucurull G, Casanova A, et al. Graph attention networks[J]. arXiv preprint, 2018, arXiv: 1710.10903.
[71] Li Y, Tarlow D, Brockschmidt M, et al. Gated graph sequence neural networks[J]. arXiv preprint, 2017, arXiv: 1511.05493.
[72] Feng Q, Feng C, Hong W, et al. Graph neural networkbased vulnerability predication[C]//2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). Adelaide, SA, Australia: IEEE, 2020: 800-801.
[73] Arakelyan S, Arasteh S, Hauser C, et al. Bin2vec: Learning representations of binary executable programs for security tasks[J]. Cybersecurity, 2021, 4(1): 1-14.
[74] Cheng X, Wang H, Hua J, et al. Static detection of control-flow-related vulnerabilities using graph embedding[C]//2019 24th International Conference on Engineering of Complex Computer Systems (ICECCS). Guangzhou, China: IEEE, 2019: 41-50.
[75] Chen H, Liu J, Liu R, et al. VASE: A twitter-based vulnerability analysis and score engine[C]//2019 IEEE International Conference on Data Mining (ICDM). Beijing, China: IEEE, 2019: 976-981.
[76] Sun H, Tong Y, Zhao J, et al. DVul-WLG: Graph embedding network based on code similarity for cross-architecture firmware vulnerability detection[C]//Information Security 24th International Conference ISC 2021. Cham: Springer, 2021: 320-337.
[77] Wang Y, Jia P, Huang C, et al. Hierarchical attention graph embedding networks for binary code similarity against compilation diversity[J]. Security and Communication Networks, 2021: 1-19.
[78] Zhou L, Huang M, Li Y, et al. GraphEye: A novel solution for detecting vulnerable functions based on graph attention network[C]//2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC). Shenzhen, China: IEEE, 2021: 381-388.
[79] Zhang H J, Li Y J, Liu Y W, et al. Vulmg: A static detection solution for source code vulnerabilities based on code property graph and graph attention network[C]//2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing(ICCWAMTIP). Chengdu, China: IEEE, 2021: 250-255.
[80] Wu T, Chen L, Du G, et al. Inductive vulnerability detection via gated graph neural network[C]//2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD). Hangzhou, China: IEEE, 2022: 519-524.
[81] Huang J, Han S, You W, et al. Hunting vulnerable smart contracts via graph embedding based bytecode matching[J]. IEEE Transactions on Information Forensics and Security, 2021, 16: 2144-2156.
[82] Lazarine B, Samtani S, Patton M, et al. Identifying vulnerable github repositories and users in scientific cyberinfrastructure: An unsupervised graph embedding approach[C]//2020 IEEE International Conference on Intelligence and Security Informatics (ISI). Arlington, VA, USA: IEEE, 2020: 1-6.
[83] Ullman S, Samtani S, Lazarine B, et al. Smart vulnerability assessment for scientific cyberinfrastructure: An unsupervised graph embedding approach[C]//2020 IEEE International Conference on Intelligence and Security Informatics (ISI). Arlington, VA, USA: IEEE, 2020: 1-6.
[84] Wang Z, Yu L, Wang S, et al. Spotting silent buffer overflows in execution trace through graph neural network assisted data flow analysis[J]. arXiv preprint, 2021, arXiv:2102.10452.
[85] Xia X, Wang Y, Yang Y. Source code vulnerability detection based on SAR-GIN[C]//2021 2nd International Conference on Electronics Communications and Information Technology (CECIT). Sanya, China: IEEE, 2021: 1144-1149.
[86] Davis J, Goadrich M. The relationship between precision-recall and ROC curves[C]//The 23rd International Conference on Machine Learning (ICML'06). New York, USA: Association for Computing Machinery, 2006: 233-240.
[87] Powers D. Evaluation: From precision recall and F-factor to ROC informedness, markedness & correlation[J]. Journal of Machine Learning Technologies, 2011, 2: 37-63.
[88] Zheng Y, Pujar S, Lewis B, et al. D2A: A dataset built for AI-based vulnerability detection methods using differential analysis[C]//2021 IEEE/ACM 43rd International Conference on Software Engineering Software Engineering in Practice (ICSE-SEIP). Madrid, ES: IEEE, 2021, 111-120.
[89] David Y, Partush N, Yahav E. Statistical similarity of binaries[C]//The 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'16). New York, USA: Association for Computing Machinery, 2016: 266-280.
Outlines

/