Abstract
Despite the capability in successfully fixing more and more real-world bugs, existing Automated Program Repair (APR) techniques are still challenged by the long-standing overfitting problem (i.e., a generated patch that passes all tests is actually incorrect). Plenty of approaches have been proposed for automated patch correctness assessment (APCA). Nonetheless, dynamic ones (i.e., those that needed to execute tests) are time-consuming while static ones (i.e., those built on top of static code features) are less precise. Therefore, embedding techniques have been proposed recently, which assess patch correctness via embedding token sequences extracted from the changed code of a generated patch. However, existing techniques rarely considered the context information and program structures of a generated patch, which are crucial for patch correctness assessment as revealed by existing studies. In this study, we explore the idea of context-aware code change embedding considering program structures for patch correctness assessment. Specifically, given a patch, we not only focus on the changed code but also take the correlated unchanged part into consideration, through which the context information can be extracted and leveraged. We then utilize the AST path technique for representation where the structure information from AST node can be captured. Finally, based on several pre-defined heuristics, we build a deep learning based classifier to predict the correctness of the patch. We implemented this idea as Cache and performed extensive experiments to assess its effectiveness. Our results demonstrate that Cache can (1) perform better than previous representation learning based techniques (e.g., Cache relatively outperforms existing techniques by \( \approx \)6%, \( \approx \)3%, and \( \approx \)16%, respectively under three diverse experiment settings), and (2) achieve overall higher performance than existing APCA techniques while even being more precise than certain dynamic ones including PATCH-SIM (92.9% vs. 83.0%). Further results reveal that the context information and program structures leveraged by Cache contributed significantly to its outstanding performance.
- [1] . 2007. On the accuracy of spectrum-based fault localization. In Testing: Academic and Industrial Conference Practice and Research Techniques-MUTATION. IEEE, 89–98. Google ScholarCross Ref
- [2] . 2019. code2seq: Generating sequences from structured representations of code. In Proceedings of the 7th International Conference on Learning Representations. OpenReview.net.Google Scholar
- [3] . 2020. Structural language models of code. In Proceedings of 37th International Conference on Machine Learning.Google ScholarDigital Library
- [4] . 2018. A general path-based representation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 404–419. Google ScholarDigital Library
- [5] . 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 40:1–40:29. Google ScholarDigital Library
- [6] . 2021. FlakeFlagger: Predicting flakiness without rerunning tests. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 1572–1584. Google ScholarDigital Library
- [7] . 2019. Getafix: Learning to fix bugs automatically. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 159:1–159:27. Google ScholarDigital Library
- [8] . 1984. Classification and Regression Trees. Routledge. Google ScholarCross Ref
- [9] . 2020. A structural model for contextual code changes. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–28.Google ScholarDigital Library
- [10] . 2020. CODIT: Code editing with tree-based neural models. IEEE Transactions on Software Engineering (2020).Google Scholar
- [11] . 2021. Fast and precise on-the-fly patch validation for all. In Proceedings of the 43rd International Conference on Software Engineering.Google ScholarDigital Library
- [12] . 2017. Contract-based program repair without the contracts. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. 637–647. Google ScholarCross Ref
- [13] . 2019. Sequencer: Sequence-to-sequence learning for end-to-end program repair. IEEE Trans. on Software Engineering (2019).Google Scholar
- [14] . 2020. Utilizing source code embeddings to identify correct patches. In Proceedings of the 2nd International Workshop on Intelligent Bug Fixing. IEEE, 18–25. Google ScholarCross Ref
- [15] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186. Google ScholarCross Ref
- [16] . 2019. Empirical review of Java program repair tools: A large-scale experiment on 2,141 bugs and 23,551 repair attempts. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 302–313. Google ScholarDigital Library
- [17] . 2016. DynaMoth: Dynamic code synthesis for automatic program repair. In Proceedings of the 11th International Workshop in Automation of Software Test. ACM, 85–91. Google ScholarDigital Library
- [18] . 2016. IntroClassJava: A benchmark of 297 small and buggy Java programs. In Technical Report #hal-01272126. University of Lille.Google Scholar
- [19] . 2001. Dynamically discovering likely program invariants to support program evolution. IEEE Transactions on Software Engineering 27, 2 (2001), 99–123.Google ScholarDigital Library
- [20] . 2001. Dynamically discovering likely program invariants to support program evolution. IEEE Transactions on Software Engineering 27, 2 (
Feb. 2001), 99–123.Google ScholarDigital Library - [21] . 2014. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering. ACM, 313–324. Google ScholarDigital Library
- [22] . 2018. Chaff from the wheat: Characterizing and determining valid bug reports. IEEE Transactions on Software Engineering 46, 5 (2018), 495–525.Google ScholarCross Ref
- [23] . 2011. EvoSuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 416–419.Google ScholarDigital Library
- [24] . 2018. An empirical study on the effect of dynamic slicing on automated program repair efficiency. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). 554–558.Google ScholarCross Ref
- [25] . 1995. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, Vol. 1. IEEE, 278–282.Google ScholarDigital Library
- [26] . 2020. CC2Vec: Distributed representations of code changes. In Proceedings of the 42nd International Conference on Software Engineering. ACM, 518–529. Google ScholarDigital Library
- [27] . 2018. Towards practical program repair with on-demand candidate generation. In Proceedings of the 40th International Conference on Software Engineering. ACM, 12–23. Google ScholarDigital Library
- [28] . 2018. Shaping program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 298–309. Google ScholarDigital Library
- [29] . 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 23rd International Symposium on Software Testing and Analysis. ACM, 437–440. Google ScholarDigital Library
- [30] . 2020. How often do single-statement bugs occur? the ManySStuBs4J dataset. In Proceedings of the 17th Mining Software Repositories. IEEE. http://arxiv.org/abs/1905.13334.Google ScholarDigital Library
- [31] . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- [32] . 2002. Logistic Regression. Springer.Google Scholar
- [33] . 2019. PathMiner: A library for mining of path-based representations of code. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 13–17.Google ScholarDigital Library
- [34] . 2018. FixMiner: Mining relevant fix patterns for automated program repair. arXiv preprint arXiv:1810.01791 (2018).Google Scholar
- [35] . 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning. JMLR.org, 1188–1196. http://proceedings.mlr.press/v32/le14.html.Google ScholarDigital Library
- [36] . 2019. On reliability of patch correctness assessment. In Proceedings of the 41st International Conference on Software Engineering. IEEE, 524–535. Google ScholarDigital Library
- [37] . 2017. S3: Syntax-and semantic-guided repair synthesis via programming by examples. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering. ACM, 593–604. Google ScholarDigital Library
- [38] . 2016. History driven program repair. In Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering. 213–224. Google ScholarCross Ref
- [39] . 2018. Overfitting in semantics-based automated program repair. Empirical Software Engineering 23, 5 (2018), 3007–3033. Google ScholarDigital Library
- [40] . 2012. GenProg: A generic method for automatic software repair. IEEE Transactions on Software Engineering 38, 1 (2012), 54–72. Google ScholarDigital Library
- [41] . 2019. Automated program repair. Commun. ACM 62, 12 (2019), 56–65. Google ScholarDigital Library
- [42] . 2011. A short introduction to learning to rank. IEICE Transactions on Information and Systems 94, 10 (2011), 1854–1862.Google ScholarCross Ref
- [43] . 2019. Improving bug detection via context-based code representation learning and attention-based neural networks. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 162:1–162:30. Google ScholarDigital Library
- [44] . 2020. Understanding the non-repairability factors of automated program repair techniques. In 2020 27th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 71–80. Google ScholarCross Ref
- [45] . 2017. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. ACM, 55–56. Google ScholarDigital Library
- [46] . 2019. Learning to spot and refactor inconsistent method names. In Proceedings of the 41st International Conference on Software Engineering. IEEE, 1–12. Google ScholarDigital Library
- [47] . 2018. A closer look at real-world patches. In Proceedings of the 34th International Conference on Software Maintenance and Evolution. IEEE, 275–286. Google ScholarCross Ref
- [48] . 2019. You cannot fix what you cannot find! An investigation of fault localization bias in benchmarking automated program repair systems. In Proceedings of the 12th IEEE International Conference on Software Testing, Verification and Validation. IEEE, 102–113. Google ScholarCross Ref
- [49] . 2019. AVATAR: Fixing semantic bugs with fix patterns of static analysis violations. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 456–467. Google ScholarCross Ref
- [50] . 2019. TBar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 31–42. Google ScholarDigital Library
- [51] . 2020. On the efficiency of test suite based program repair: A systematic assessment of 16 automated repair systems for Java programs. In Proceedings of the 42nd International Conference on Software Engineering. ACM, 615–627. Google ScholarDigital Library
- [52] . 2018. Mining stackoverflow for program repair. In Proceedings of the 25th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 118–129. Google ScholarCross Ref
- [53] . 2020. Automating just-in-time comment updating. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. ACM.Google ScholarDigital Library
- [54] . 2015. Staged program repair with condition synthesis. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering. ACM, 166–178. Google ScholarDigital Library
- [55] . 2016. An analysis of the search spaces for generate and validate patch generation systems. In Proceedings of the 38th International Conference on Software Engineering. IEEE, 702–713. Google ScholarDigital Library
- [56] . 2021. Commit2vec: Learning distributed representations of code changes. SN Computer Science 2, 3 (2021), 1–16.Google Scholar
- [57] . 2020. CoCoNuT: Combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 101–114. Google ScholarDigital Library
- [58] . 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, (Nov2008), 2579–2605.Google Scholar
- [59] . 2019. Bears: An extensible Java bug benchmark for automatic program repair studies. In Proceedings of the 26th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 468–478. Google ScholarCross Ref
- [60] . 2013. KATCH: High-coverage testing of software patches. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. 235–245.Google ScholarDigital Library
- [61] . 2016. Astor: A program repair library for Java (demo). In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 441–444. Google ScholarDigital Library
- [62] . 2018. Ultra-large repair search space with automatically mined templates: The Cardumen mode of Astor. In Proceedings of the 10th International Symposium on Search Based Software Engineering. Springer, 65–86. Google ScholarCross Ref
- [63] . 2018. The living review on automated program repair. In HAL/Archives-Ouvertes. fr, Technical Report.Google Scholar
- [64] . 2021. Exploring true test overfitting in dynamic automated program repair using formal methods. In Proceedings of the 14th IEEE International Conference on Software Testing, Verification and Validation.Google ScholarCross Ref
- [65] . 2007. Randoop: Feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications Companion. 815–816.Google ScholarDigital Library
- [66] . 2013. Performance analysis of Naive Bayes and J48 classification algorithm for data classification. International Journal of Computer Science and Applications 6, 2 (2013), 256–261.Google Scholar
- [67] . 2014. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering. ACM, 254–265. Google ScholarDigital Library
- [68] . 2021. On the impact of flaky tests in automated program repair. In Proceedings of the 28th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 295–306. Google ScholarCross Ref
- [69] . 1999. The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability 1 (1999), 127–190.Google ScholarDigital Library
- [70] . 2018. Bugs.jar: A large-scale, diverse dataset of real-world Java bugs. In Proceedings of the 15th IEEE/ACM International Conference on Mining Software Repositories. ACM, 10–13. Google ScholarDigital Library
- [71] . 2015. Is the cure worse than the disease? overfitting in automated program repair. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering. ACM, 532–543. Google ScholarDigital Library
- [72] . 2018. Dissection of a bug dataset: Anatomy of 395 patches from Defects4J. In Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 130–140. Google ScholarCross Ref
- [73] . 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (2014), 1929–1958.Google ScholarDigital Library
- [74] . 2016. Anti-patterns in search-based program repair. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 727–738.Google ScholarDigital Library
- [75] . 2000. Separating style and content with bilinear models. Neural Computation 12, 6 (2000), 1247–1283.Google ScholarDigital Library
- [76] . 2020. Evaluating representation learning of code changes for predicting patch correctness in program repair. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. ACM.Google ScholarDigital Library
- [77] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [78] . 2021. Beep: Fine-grained Fix Localization by Learning to Predict Buggy Code Elements. arXiv:2111.07739[cs.SE].Google Scholar
- [79] . 2019. How different is it between machine-generated and developer-provided patches? An empirical study on the correct patches generated by automated program repair techniques. In Proceedings of the 13th International Symposium on Empirical Software Engineering and Measurement. IEEE, 1–12. Google ScholarCross Ref
- [80] . 2021. Lightweight global and local contexts guided method name recommendation with prior knowledge. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 741–753. Google ScholarDigital Library
- [81] . 2020. Automated patch correctness assessment: How far are we?. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. ACM, 968–980. Google ScholarDigital Library
- [82] . 2018. Context-aware patch generation for better automated program repair. In Proceedings of the 40th International Conference on Software Engineering. ACM, 1–11. Google ScholarDigital Library
- [83] . 2016. Deep learning code fragments for code clone detection. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 87–98.Google ScholarDigital Library
- [84] . 2021. Peculiar: Smart contract vulnerability detection based on crucial data flow graph and pre-training techniques. In 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 378–389. Google ScholarCross Ref
- [85] . 2017. Identifying test-suite-overfitted patches through test case generation. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 226–236. Google ScholarDigital Library
- [86] . 2017. Leveraging syntax-related code for automated program repair. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. 660–670. Google ScholarCross Ref
- [87] . 2018. Identifying patch correctness in test-based program repair. In Proceedings of the 40th International Conference on Software Engineering. ACM, 789–799. Google ScholarDigital Library
- [88] . 2017. Precise condition synthesis for program repair. In Proceedings of the 39th IEEE/ACM International Conference on Software Engineering. IEEE, 416–426. Google ScholarDigital Library
- [89] . 2017. Nopol: Automatic repair of conditional statement bugs in Java programs. IEEE Transactions on Software Engineering 43, 1 (2017), 34–55. Google ScholarDigital Library
- [90] . 2020. Exploring the differences between plausible and correct patches at fine-grained level. In Proceedings of the 2nd International Workshop on Intelligent Bug Fixing. IEEE, 1–8. Google ScholarCross Ref
- [91] . 2017. Better test cases for better automated program repair. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering. ACM, 831–841. Google ScholarDigital Library
- [92] . 2021. Automated classification of overfitting patches with statically extracted code features. IEEE Transactions on Software Engineering (2021).Google ScholarDigital Library
- [93] . 2021. Automated patch assessment for program repair at scale. Empirical Software Engineering 26, 2 (2021), 1–38.Google ScholarDigital Library
- [94] . 2018. A correlation study between automated program repair and test-suite metrics. Empirical Software Engineering 23, 5 (2018), 2948–2979. Google ScholarDigital Library
- [95] . 2019. Learning to represent edits. In International Conference on Learning Representations. https://openreview.net/forum?id=BJl6AjC5F7.Google Scholar
- [96] . 2019. Alleviating patch overfitting with automatic test generation: A study of feasibility and effectiveness for the Nopol repair system. Empirical Software Engineering 24, 1 (2019), 33–67. Google ScholarDigital Library
- [97] . 2018. ARJA: Automated repair of Java programs via multi-objective genetic programming. IEEE Transactions on Software Engineering (2018). Google ScholarCross Ref
- [98] . 2019. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 783–794.Google ScholarDigital Library
- [99] . 2006. Pruning dynamic slices with confidence. In 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarDigital Library
- [100] . 2019. CNN-FL: An effective approach for localizing faults using convolutional neural networks. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 445–455.Google ScholarCross Ref
- [101] . 2018. DeepSim: Deep learning code functional similarity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 141–151.Google ScholarDigital Library
- [102] . 2019. An empirical study of fault localization families and their combinations. IEEE Transactions on Software Engineering (2019).Google ScholarDigital Library
Index Terms
- Context-Aware Code Change Embedding for Better Patch Correctness Assessment
Recommendations
Context-aware patch generation for better automated program repair
ICSE '18: Proceedings of the 40th International Conference on Software EngineeringThe effectiveness of search-based automated program repair is limited in the number of correct patches that can be successfully generated. There are two causes of such limitation. First, the search space does not contain the correct patch. Second, the ...
Automated patch correctness assessment: how far are we?
ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software EngineeringTest-based automated program repair (APR) has attracted huge attention from both industry and academia. Despite the significant progress made in recent studies, the overfitting problem (i.e., the generated patch is plausible but overfitting) is still a ...
Evaluating representation learning of code changes for predicting patch correctness in program repair
ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software EngineeringA large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the ...
Comments