Code-Aware Prompting: A Study of Coverage Guided Test Generation in Regression Setting Using LLM
Proc ACM Softw Eng(2024)
Columbia University | AWS AI Labs
The authors of this paper include Ryan Gabriel (Research Interests: Machine Learning, Code Clone Detection, Software Defect Prediction, Graph Perception, Entropy), Sid Jain (Research Interests: Code Generation, Mathematical Reasoning, previously worked at Amazon AWS AI, responsible for unit test generation in CodeWhisperer), Mingyue Shang (Research Interests: Natural Language Processing, Code Generation, Topic Modeling, Large Language Models, Human-Computer Cognitive Systems), Shiqi Wang (Research Interests: Robustness, Representation Learning, Security Analysis, Offline Reinforcement Learning, Language Modeling), Xiaofei Ma (Research Interests: Topic Modeling, Information Retrieval, Descriptive Logic, Machine Translation, Neural Machine Translation), Murali Ramanathan (Research Interests: Multiple Sclerosis, Cholesterol, MRI, Code Clone Detection, Language Modeling), and Baishakhi Ray (Research Interests: Deep Learning, Testing, Neural Networks, Fuzz Testing, Machine Learning). The authors are affiliated with institutions such as Columbia University, NVIDIA, Amazon Web Services, New York State University, etc.
1. Introduction
- Testing plays a crucial role in ensuring software quality, but traditional search-based software testing (SBST) methods often struggle to achieve desirable test coverage when dealing with complex software units.
- Recent work on test generation using large language models (LLM) has focused on improving generation quality by optimizing the test generation context and correcting errors in the model's output, but these strategies use fixed prompting approaches that prompt the model to generate tests without additional guidance.
- As a result, LLM-generated test suites still suffer from low coverage issues.
- This paper proposes a novel method named SymPrompt, which constructs prompts using code properties to enable LLMs to generate test inputs for more complex focus methods.
- The workflow of SymPrompt includes three stages:
- Collect approximate path constraints and return values.
- Gather properties of the focus method, including parameter types, external library dependencies, code context, etc.
- Use the collected information to construct a prompt for each execution path and request the LLM to generate a test input that will execute the corresponding path.
- SymPrompt is implemented on Python using the TreeSitter parsing framework and evaluated on 897 focus methods that are challenging for existing SBST test generation.
- SymPrompt improves the proportion of correct test generation by 5 times on CodeGen2 and increases relative coverage by 26%.
- To check the generalization ability of our technique, we also evaluated SymPrompt with the state-of-the-art LLM GPT-4: SymPrompt increases relative coverage by 105%.
2. Working Example
- This paper provides a working example that illustrates how to construct path constraint prompts and use them to guide LLM in generating high-coverage tests in regression settings.
- The example demonstrates the limitations of SBST and LLM-based test generation and explains how SymPrompt overcomes these limitations.
3. Method
- This paper details the workflow of SymPrompt, which includes three steps:
- Step I: Collect Approximate Path Constraints: Collect path constraints by statically traversing the abstract syntax tree (AST) of the focus method.
- Step II: Context Construction: Gather the type context and dependency context of the focus method to help the model generate correct test cases.
- Step III: Test Generation: Construct prompts using the collected path constraints and context and use the LLM to generate test cases.
4. Evaluation
- This paper evaluates SymPrompt through the following research questions:
- RQ1: How does SymPrompt affect the test performance of simple test generation prompting methods in regression settings?
- RQ2: Does SymPrompt generalize to focus methods unseen in the training data?
- RQ3: How do the design choices of SymPrompt affect its performance?
- RQ4: Does SymPrompt improve the performance of large models in generating tests?
- SymPrompt was evaluated using CodeGen2 and GPT-4, and it was found to significantly improve test generation performance in all cases.
5. Threats and Discussion
- This paper discusses potential threats in the evaluation, such as the validity of the model and benchmarks, the validity of memory, and the validity of metrics.
- The paper also discusses the limitations of test generation in regression settings.
6. Related Work
- This paper is related to the following fields: search-based software testing, symbolic test generation, test generation using LLMs, and hybrid LLM-SBST test generation.
7. Conclusion
- This paper introduces SymPrompt, a new method that generates test cases by breaking down the test suite generation process into structured, code-aware prompt sequences.
- SymPrompt significantly improves test suite generation of CodeGen2 and achieves substantial improvements in the proportion of correct test generation and coverage.
- The paper also shows that when given specific instructions to analyze execution path constraints, GPT-4 is able to generate its own path constraint prompts, which will increase its coverage of generated test suites.
Q: What specific research methods were used in the paper?
- Path Constraint Prompting: A novel code-aware prompting strategy that decomposes the test suite generation process into multiple stages, generating test inputs for each execution path.
- Static Symbolic Analysis: Used to collect approximate path constraints and return expressions, guiding the LLM to generate test cases.
- Path Minimization: Reduces the number of paths by focusing only on paths with unique branch conditions, thus lowering computational costs.
- Context Construction: Extracts the types and dependencies of the focus method, providing the LLM with a richer context.
- Iterative Prompting: Adds generated test cases to subsequent prompts to gradually increase test coverage.
- Experimental Evaluation: Assesses the performance of SymPrompt using CodeGen2 and GPT-4 on 897 hard-to-test focus methods and compares it with baseline methods.
Q: What are the main research findings and outcomes?
- SymPrompt significantly improves test generation performance: Compared to baseline prompts, SymPrompt increases the correctness of generated test cases by 5 times and coverage by 26%.
- SymPrompt is suitable for large models: On GPT-4, SymPrompt increases test case coverage by 105%.
- Path constraint prompting is effective: It guides the LLM to generate test cases with higher coverage and fewer errors.
- Context construction is beneficial: Type and dependency contexts help the LLM generate more accurate test cases.
- SymPrompt has generalization ability: SymPrompt can still generate more correct and higher coverage test cases on unseen projects.
Q: What are the current limitations of this research?
- Model and Benchmark Limitations: The research mainly focuses on Python projects and two LLMs, CodeGen2 and GPT-4, and the results may not be applicable to other languages or models.
- Memory Bias: Although the AmIInTheStack tool is used to prevent training data memory bias, there is still a risk that CodeGen2 may have seen some focus methods or similar code.
- Metric Limitations: The evaluation metrics may not fully reflect the complexity and practicality of the test cases.
- Regression Testing Limitations: SymPrompt assumes that the implementation of the focus method is correct, so the generated test cases may not be able to detect implementation errors.
