Repairing Programs with AI Agents

By Pekka Enberg — March 2026

AI coding agents are versatile, general-purpose tools for working on your code—they help you architect your systems, implement new features, and more. However, when you are fixing a bug in your system, the workflow is surprisingly manual. You receive a bug report or see a test failure. If you ask the AI coding agent just to fix the problem, you can easily end up wasting a lot of precious human time and tokens. The AI agent might hallucinate a fix that looks good on the outside, but is blatantly wrong.

Of course, you can ask the AI agent to write a reproducer for the issue, the test case might be totally useless, and if you forget to check that it reproduces the same problem manually, the AI coding agent will fix the wrong thing. When repairing programs, we want to formalize this workflow and let AI agents handle it end-to-end. Not as a chat session where you paste error logs and ask for help, but as a structured, auditable pipeline: bug report in, verified fix out.

That is why I built rp It is a command-line tool that takes a bug report, which can be a GitHub issue, a test failure, or just a description in plain English, and turns it into a fix through a three-step workflow: inspect, check, and fix using a coding agent such as Claude Code, Codex, or OpenCode.

The Workflow

The key idea behind rp is that bug fixing follows a repeatable pattern, and you can automate each step independently.

Inspection. The rp inspect command takes a bug report and produces a reproducer. It invokes an AI agent to analyze the problem, write a summary, and generate the reproducer, which is a self-contained shell script that exits with a non-zero status while the bug exists. The reproducer is the key artifact. It turns a potentially ambiguous bug report into a concrete, deterministic test that a machine can evaluate.

Verification. The rp check command runs the reproducer against the local source tree. It tells you whether the bug actually reproduces in your environment. The verdict is one of three: reproduced (the bug is real), not reproduced (the bug is already fixed or doesn't apply), or broken reproducer (the reproducer script itself has issues).

Fixing. The rp fix command invokes an AI agent with the full context of the issue — the original report, the summary, the analysis, the reproducer, and your project's conventions — and asks it to fix the bug. After the agent makes its changes, RP re-runs the reproducer to verify the fix actually works.

After `rp` is done with the fixing, you have a reproducer and a bug fix in your source tree, which you can review, commit, and merge.

Example: Fixing a Test Failure in Turso

Let's walk through a real example. We have a copy of the SQLite TCL test suite in the Turso project for compatibility, and some of the tests are failing. Instead of debugging it manually, I asked rp to investigate:

$ rp inspect "Run ./testing/sqlite3/all.test and reproduce the first failure you see"
inspect: source Run ./testing/sqlite3/all.test and reproduce the first failure you see
inspect: issue directory .rp/issues/run-testing-sqlite3-all-test-and-reproduce-the-f
inspect: agent claude
inspect: generating reproducer
inspect: wrote .rp/issues/run-testing-sqlite3-all-test-and-reproduce-the-f/SOURCE.txt
inspect: wrote .rp/issues/run-testing-sqlite3-all-test-and-reproduce-the-f/SUMMARY.txt
inspect: wrote .rp/issues/run-testing-sqlite3-all-test-and-reproduce-the-f/inspect.md
inspect: wrote .rp/issues/run-testing-sqlite3-all-test-and-reproduce-the-f/reproducer.sh
inspect: wrote .rp/issues/run-testing-sqlite3-all-test-and-reproduce-the-f/status
inspect: done

The agent ran the test suite, identified the first failure, analyzed it, and produced a set of artifacts in the .rp/issues/ directory.

The inspect.md contains the agent's analysis of the failure:

# Bug Analysis: `select1-3.9` — Wrong error message for `count()` with incorrect argument count

## Failing test
- **Test ID**: `select1-3.9` (first failure in `testing/sqlite3/all.test`)
- **SQL**: `SELECT f1 FROM test1 WHERE count(f1,f2)!=11`
- **Expected error**: `wrong number of arguments to function count()`
- **Actual error**: `unknown function count`

## Root cause

The function resolution pipeline has two layers:

1. **`Func::resolve_function(name, arg_count)`** in `core/function.rs:1185-1193` — correctly
   identifies that `count` with 2 args is invalid and returns
   `Err("wrong number of arguments to function count()")`.

2. **`Emitter::resolve_function()`** in `core/translate/emitter/mod.rs:202-210` — calls
   `Func::resolve_function(...).ok()`, which **discards the error** and converts it to `None`.
   It then falls through to the external function lookup, which also returns `None`.

3. **Caller in `core/translate/expr.rs:1507-1508`** — sees `None` from the resolver and emits
   the generic `"unknown function {name}"` error.

The fix should propagate the specific error from `Func::resolve_function` rather than swallowing
it via `.ok()`. The emitter's `resolve_function` needs to distinguish between "function not found"
(try external functions) and "function found but wrong args" (return the specific error).

The agent traced the bug all the way down to a .ok() call that silently discards a specific error message. That is a precise root cause analysis. Now let's look at the reproducer:

#!/bin/sh
# Reproducer for select1-3.9 failure: count() with wrong number of arguments
# gives "unknown function count" instead of "wrong number of arguments to function count()"

set -e

cd "$(dirname "$0")"
# Find repo root (look for Cargo.toml)
while [ ! -f Cargo.toml ] && [ "$(pwd)" != "/" ]; do cd ..; done

# Build tursodb
cargo build --bin tursodb 2>/dev/null

# Run the failing query
OUTPUT=$(echo "CREATE TABLE test1(f1 int, f2 int); \
INSERT INTO test1 VALUES(11,22); \
INSERT INTO test1 VALUES(33,44); \
SELECT f1 FROM test1 WHERE count(f1,f2)!=11;" \
| cargo run -q --bin tursodb -- -q 2>&1 || true)

echo "Output: $OUTPUT"

# Check that the error message matches SQLite's expected message
case "$OUTPUT" in
  *"wrong number of arguments to function count()"*)
    echo "PASS: Correct error message"
    exit 0
    ;;
  *"unknown function count"*)
    echo "FAIL: Got generic 'unknown function' instead of specific argument count error"
    exit 1
    ;;
  *)
    echo "FAIL: Unexpected output"
    exit 1
    ;;
esac

Next I ran rp check to confirm the bug reproduces:

$ rp check
check: issue run-testing-sqlite3-all-test-and-reproduce-the-f
check: reproducer .rp/issues/run-testing-sqlite3-all-test-and-reproduce-the-f/reproducer.sh
check: running reproducer
check: verdict reproduced
check: exit code 1
check: wrote .rp/issues/run-testing-sqlite3-all-test-and-reproduce-the-f/check.status

The verdict is reproduced — the bug is real and the reproducer script reliably triggers it. I then asked rp to fix it:

$ rp fix
fix: issue run-testing-sqlite3-all-test-and-reproduce-the-f
fix: issue directory .rp/issues/run-testing-sqlite3-all-test-and-reproduce-the-f
fix: invoking agent

The agent now has everything it needs: the original description, the analysis, the reproducer, and the project's test conventions. It attempts a fix and verifies that the reproducer passes afterward. Here is the resulting commit:


core/translate: Propagate error from function resolution

The `Func::resolve_function()` function returns specific errors like "wrong
number of arguments to function count()" but resolve_function() in the emitter
was calling .ok() which discarded the error and converted it to None. Callers
then emitted a generic "unknown function" error instead.

Change Emitter::resolve_function() to return Result<Option<Func>> so that
specific errors (e.g. wrong arg count) propagate to the caller while "not
found" remains None.

Fixes select1-3.9 in testing/sqlite3/all.test.

The key change is in core/translate/emitter/mod.rs, where the agent replaced the .ok() call that was swallowing the error:

 // Before:
-pub fn resolve_function(&self, func_name: &str, arg_count: usize) -> Option<Func> {
-    match Func::resolve_function(func_name, arg_count).ok() {
-        Some(func) => Some(func),
-        None => self

 // After:
+pub fn resolve_function(
+    &self,
+    func_name: &str,
+    arg_count: usize,
+) -> Result<Option<Func>, LimboError> {
+    match Func::resolve_function(func_name, arg_count) {

The fix touches four files and changes 30 lines total — propagating the Result type through the call chain so that specific errors like "wrong number of arguments" reach the user instead of being replaced by a generic "unknown function" message.

How It Works

All state lives in a .rp/ directory at the root of your repository. Each issue gets its own subdirectory with the following structure:

.rp/
└── issues/
    └── run-testing-sqlite3-all-test-and-reproduce-the-f/
        ├── SOURCE.txt        # the original input
        ├── SUMMARY.txt       # condensed summary
        ├── inspect.md        # agent's analysis
        ├── reproducer.sh     # deterministic reproducer script
        ├── status            # current state (inspected, reproduced, fixed, ...)
        ├── check.stdout      # output from last check run
        ├── check.stderr      # errors from last check run
        └── check.status      # machine-readable verdict

You configure rp for your project with a .rp.yml manifest that tells it how to verify fixes and where tests live. For example, you might have:

version: 1

verify-cmd: make test

tests:
  - testing/conformance

guidance: |
  Prefer adding a minimal regression test under testing/conformance.
  Choose the most specific suite for the bug.

The verify-cmd is the command that must pass for a fix to be considered valid. The tests field lists directories where the agent is expected to add regression tests. And guidance lets you encode your project's conventions in plain English — the agent reads it as part of its context.

Under the hood, rp delegates to AI coding agents. It currently supports Claude, Codex, and OpenCode as backends. The agent abstraction is intentionally thin: rp handles the workflow, the artifact management, and the verification; the agent handles the reasoning and code changes.

What's Next

The tool is still early and in active development. But the direction seems promising: bug fixing should be a pipeline, not a conversation. A structured workflow with deterministic reproducers, machine-checkable verdicts, and auditable artifacts is something we can automate, verify, and trust.

There are various things I want to do to improve things:

The code is available at github.com/penberg/rp.