Paper Review — Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution…

Debugging programs is essential yet challenging, even for advanced Large Language Models (LLMs). In their ACL 2024 paper, “Debug like a…

Paper Review — Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution…

Paper Review — Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step (ACL 2024)

Debugging programs is essential yet challenging, even for advanced Large Language Models (LLMs). In their ACL 2024 paper, Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step,” researchers present LDB, a framework that teaches LLMs to debug like human developers.

Background

Debugging is essential in software development, ensuring programs run correctly and efficiently. As Large Language Models (LLMs) emerge as programming aids, their ability to write code improves, but debugging remains a challenging task, particularly for runtime and logical errors. LDB addresses these challenges by building on several foundational concepts:

Execution and Verdicts

Program execution generates runtime insights, such as intermediate variable values and results, which reveal errors that static analysis might miss. Verdicts, such as whether a test case passes or fails, act as crucial feedback, guiding both developers and LLMs in identifying whether a program aligns with its intended functionality. By incorporating execution data, LDB enables LLMs to validate correctness at each step of execution, refining programs iteratively.

Control Flow and Debugging Granularity

Control Flow Graphs (CFGs) represent the logical flow of a program, with nodes as code blocks and edges as transitions. These graphs help analyze program behavior, detect loops, and trace variables. LDB leverages CFG principles by segmenting programs into basic blocks, enabling targeted debugging for individual sections. This modular approach mirrors how developers debug complex logic, focusing on isolated parts of a program before addressing the whole.

Improving Code Generation with Pass@1

Pass@1 accuracy measures the probability of an LLM generating a correct solution on its first attempt. While useful for gauging model performance, it highlights the need for robust debugging frameworks like LDB. By iteratively identifying and resolving issues, LDB improves success rates beyond the initial generation, supporting more reliable outcomes.

VLLM for Scalable Debugging

VLLM (Variable-Length Lightweight Model) provides a high-throughput, memory-efficient serving engine that ensures LDB operates at scale. It enables responsive real-time debugging by processing inputs efficiently, allowing LDB to deliver iterative improvements without significant delays.


Key Features of LDB

LDB focuses on making debugging for LLM-generated code more efficient and precise by combining runtime execution insights and step-by-step validation.

Here’s how LDB tackles the debugging process:

1. Breaking Code into Basic Blocks

LDB segments a program into basic blocks, smaller chunks of code that represent individual steps in the logic. This modular approach mirrors how developers debug complex logic, focusing on one part at a time before addressing the entire program. By validating each block iteratively, LDB ensures that errors are identified and resolved at their root.

2. Tracking Intermediate States

LDB captures runtime execution data, such as variable values and execution paths, after each block. These intermediate states provide insights that static analysis often misses, such as identifying edge cases or misaligned logic. By analyzing these states, LDB enables precise localization of bugs.

3. Human-Inspired Debugging Granularity

Using Control Flow Graphs (CFGs), LDB visualizes the logical flow of a program. Nodes represent basic blocks, and edges show transitions. This allows LDB to focus debugging efforts on isolated segments while maintaining a broader understanding of the program’s structure — similar to how developers debug step-by-step.

4. Iterative Refinement

LDB adopts an iterative approach:

  1. Profile the program by executing it and capturing failures.
  2. Debug the program using intermediate states and test case results.
  3. Regenerate code to fix errors and repeat the process until the program passes all tests.
    This ensures a systematic refinement process, improving accuracy with each iteration.

5. Scalability with VLLM

To handle the computational demands of debugging, LDB integrates with VLLM, a lightweight, memory-efficient serving engine. This ensures real-time debugging, allowing the framework to operate seamlessly on large datasets.


Experiments

LDB has been evaluated on several benchmarks, including HumanEval, MBPP (Mostly Basic Programming Problems), and TransCoder.

Here’s what makes LDB stand out:

1. Problem Formulation

Each task involves generating and refining code based on a problem description, visible test cases (to guide debugging), and hidden test cases (for final evaluation). LDB focuses on using runtime execution feedback to identify and resolve errors effectively.

2. Debugging Workflow

  • Profiling: Collect runtime data, segment execution into blocks, and inspect intermediate variable states.
  • Debugging Verdicts: The LLM evaluates whether each block aligns with the intended functionality, flags buggy blocks, and provides explanations for errors.
  • Regeneration: Using the debugging feedback, LDB refines the program iteratively until it passes all visible tests or reaches the iteration limit.

3. Experimental Results

LDB has demonstrated significant improvements in debugging performance, achieving state-of-the-art accuracy on HumanEval, MBPP, and TransCoder benchmarks. For example:

  • LDB outperformed existing methods by up to 9.8%, showcasing its ability to address complex bugs.
  • Even when applied to advanced models like GPT-4, LDB enhanced code quality by leveraging runtime execution insights.

Code Explanation

The original code for the LLMDebugger project is available on GitHub.

1. Tokenizer Initialization

The tokenizer is initialized to count tokens processed during debugging. This helps in monitoring efficiency and API usage costs.

from transformers import GPT2Tokenizer 
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

2. Main Debugging Function

This function performs the step-by-step debugging process for a single programming task.

Function: debug()

This function performs the step-by-step debugging process for a single programming task.

2.1 Initialization

The components necessary for debugging are initialized:

  • PyExecutor: Executes Python code and returns results.
  • PyGenerator: Handles LLM-based code generation and improvements.
  • model_factory: Creates the specified LLM interface.
exe = PyExecutor() 
gen = PyGenerator() 
model = model_factory(model_name, port)

2.2 Initial Implementation

The function generates the initial implementation of the code based on the dataset’s seed and prompt.

cur_func_impl = prepare_function_from_seed( 
    dataset_type, item["prompt"], item["seed"], item["entry_point"] 
) 
implementations.append(cur_func_impl)

2.3 Execution and Feedback

The generated implementation is executed using the executor, and the failed test cases are logged as feedback.

is_passing, failed_tests, _ = exe.execute(cur_func_impl, tests_i) 
test_feedback.append(failed_tests)

If the implementation passes all tests, the function exits early.

if is_passing: 
    is_solved = exe.evaluate(item["entry_point"], cur_func_impl, item["test"], timeout=10) 
    break

2.4 Debugging Iterations

If the initial attempt fails, the function enters an iterative debugging loop.

while cur_iter < max_iters: 
    # Select a random failed test for debugging 
    selected_test = failed_tests[random.randint(0, len(failed_tests) - 1)] if failed_tests else None
# Use LDB to generate an improved implementation 
    messages = gen.ldb_debug( 
        item["prompt"], 
        debug_cur_func_impl, 
        selected_test, 
        item["entry_point"], 
        model, 
        messages, 
        dataset_type, 
        level, 
    )
cur_func_impl, cur_messages = gen.ldb_generate( 
        func_sig=item["prompt"], 
        model=model, 
        prev_func_impl=cur_func_impl, 
        messages=messages, 
        failed_tests=selected_test, 
        dataset_type=dataset_type, 
    )

2.5 Refining Code

The debug process refines code using test feedback. This involves:

  • Adding comments to clarify the function’s purpose.
  • Adjusting logic based on failed test cases.
if dataset_type in ["HumanEval", "MBPP"]: 
    # Add comments based on the prompt 
    if not find_comment(cur_func_impl, item["entry_point"]): 
        debug_cur_func_impl = insert_comment( 
            cur_func_impl, extrace_comment(item["prompt"]), item["entry_point"] 
        )

2.6 Evaluation and Exit

After each iteration, the refined implementation is tested. If all internal tests pass, the implementation is evaluated against the real test cases.

is_passing, failed_tests, _ = exe.execute(cur_func_impl, tests_i) 
if is_passing: 
    is_solved = exe.evaluate(item["entry_point"], cur_func_impl, item["test"], timeout=10) 
    if is_solved: 
        item["solution"] = cur_func_impl 
    break

2.7 Logging Results

The function logs debugging results, including:

  • Whether the task was solved.
  • Intermediate implementations and feedback.
  • Token usage during debugging.
item["is_passing"] = is_passing 
    item["is_solved"] = is_solved 
    item["implementations"] = implementations 
    item["test_feedback"] = test_feedback 
    item["solution"] = cur_func_impl 
    item["generated_test"] = tests_i 
    item["debug_iter"] = cur_iter 
    item["token_nums"] = token_nums 
 
    with FileLock(log_path + ".lock"): 
        write_jsonl(log_path, [item], append=True) 
    print(f'completed {i+1}/{num_items}')

3. Parallel Debugging

Function: run_ldb()

This function orchestrates debugging across multiple tasks in the dataset.

3.1 Preparing Arguments

The dataset is processed, and arguments are prepared for debugging.

args = iter( 
    [ 
        (i, item, log_path, model_name, num_items, pass_at_k, max_iters, port, level) 
        for i, item in enumerate_resume(dataset, log_path, seedfile, testfile) 
    ] 
)

3.2 Running in Parallel

The debugging tasks are executed sequentially or in parallel, depending on the n_proc value.

if n_proc == 1: 
    for item in args: 
        debug(*item) 
else: 
    with Pool(n_proc) as pool: 
        pool.starmap(debug, args)

3.3 Calculating Accuracy

After all tasks are processed, the function calculates accuracy as the percentage of solved tasks.

print("Accuracy:", count_solved(log_path))

Example Flow of Debugging

Input:

  • A dataset with tasks, including a prompt, seed, and test cases.

Execution:

  • Initial implementation generated.

Debugging iterates until:

  • Tests pass.
  • Maximum iterations are reached.

Output:

Logs containing:

  • Final solution.
  • Intermediate feedback.
  • Debugging statistics (e.g., tokens used, iterations).