Benchmarking AI Agents on
Property-Based Bug Discovery
PBT-Bench evaluates whether AI agents can discover hidden semantic bugs in real Python libraries using property-based testing with Hypothesis — guided solely by official API documentation, without any knowledge of the bugs.
A different kind of benchmark
Most code benchmarks ask AI to fix a known bug or pass visible test gaps. PBT-Bench asks a harder question: can an agent discover bugs that no one told it about?
Bugs Invisible to Inspection
Every injected bug survives a 10-minute code review and passes all existing unit tests. Discovery requires systematic semantic reasoning, not grepping the source.
Documentation as the Sole Oracle
Agents receive official API documentation only. Bugs hide in the gap between what the docs promise and what the implementation delivers. No bug descriptions, no hints.
Automated F→P Evaluation
The Fail-to-Pass criterion requires zero human judgment: a test must FAIL on the buggy library and PASS on the fixed version. Per-bug, per-function scoring. Fully reproducible.
Hidden Ground Truth
Reference property tests are never shown to agents. Any independent test that achieves F→P for a specific bug counts as a discovery — no pattern matching to a reference solution.
How It Works
Bug Injection
Semantic bugs are injected into real Python libraries via unified diff patches, at least two call-chain layers below the public API surface. Each bug violates a documented invariant, passes all existing unit tests, and has a precisely characterized trigger region — making it discoverable only through informed property testing.
Documentation as Oracle
The agent receives official API documentation as its only oracle. Its task is to infer semantic invariants from the docs — contracts the library promises to uphold — and design Hypothesis input strategies that concentrate probability mass on the trigger region. Source code is accessible but insufficient for localization.
Property Test Design
The agent writes Hypothesis @given property tests
checking semantic contracts — roundtrip consistency,
commutativity, idempotency, algebraic laws, cross-API protocol
invariants, and more. The key challenge is crafting focused input
strategies from documentation rather than defaulting to unconstrained
random search.
F→P Evaluation
Each test function is evaluated independently against each
(buggy_lib, fixed_lib) pair. A bug is found when
at least one test FAILs on the buggy version and PASSes on the
version with only that bug reversed. Bug-level recall and
problem-level coverage are both reported.
Two Evaluation Tracks
Unconstrained Testing
The agent may use any testing approach. No framework is prescribed — unit tests, ad-hoc scripts, or property tests all count. This measures what's achievable with current best practices and serves as the comparison baseline.
Property-Based Testing
The agent must use Hypothesis @given with custom
input strategies and a compact property taxonomy (Roundtrip,
Model-oracle, Invariant, Boundary). The scaffold eliminates
framework-choice overhead and focuses effort on strategy design.
Results
4,800 agent trajectories across 8 models × 2 prompt modes × 3 independent runs. Bug-level recall averaged over three runs on all 100 problems.
| Model | PBT Recall | Baseline Recall | Δ (pp) | Perfect Rate (PBT) |
|---|---|---|---|---|
| Sonnet 4.6 | 83.4% | 76.7% | +6.7 | 67.0% |
| Qwen 3.6 Plus | 78.0% | 53.5% | +24.5 | 56.0% |
| GLM 5.1 | 72.1% | 66.6% | +5.5 | 63.0% |
| Gemini 3 Flash | 62.8% | 56.2% | +6.6 | 37.0% |
| DeepSeek V3.2 | 61.0% | 64.2% | −3.2 | 38.7% |
| Step 3.5 Flash | 58.1% | 37.8% | +20.3 | 41.0% |
| Qwen 3.5-30B-A3B | 54.3% | 31.4% | +22.9 | 25.7% |
| Grok 4.1 Fast | 42.1% | 50.1% | −8.0 | 21.0% |
PBT scaffolding helps mid-capability models most
Recall gains (Δ = PBT − Baseline) range from +24.5 pp (Qwen 3.6 Plus) and +22.9 pp (Qwen 3.5-30B-A3B) down to −8.0 pp (Grok 4.1 Fast). For the four models with the largest gaps, |Δ| exceeds the sum of their 95% bootstrap CIs, so those signs are statistically robust. The pattern is consistent with the PBT prompt substituting for a capability weaker models lack — not complementing a skill stronger models already carry.
Coverage and full recall diverge; the benchmark is not saturated
The top cell (Sonnet 4.6 Baseline) achieves 98.0% problem coverage but only 43.0% perfect recall — it frequently finds one bug but misses co-located bugs. Switching to PBT trades coverage (92.7%) for a large gain in perfect recall (67.0%), consistent with its instruction to write multiple focused tests per bug. No cell exceeds 70% perfect rate.
Ensemble upper bound: 99.5% with 16 cells
The best single cell (Sonnet 4.6 PBT) reliably covers 317 of 365 bugs (86.8%). Adding Sonnet 4.6 Baseline lifts union coverage to 95.1%; the full 16-cell ensemble reaches 99.5%. Within-model mode complementarity (PBT + Baseline, same architecture) accounts for 65% of ensemble headroom — suggesting the two modes elicit categorically different testing strategies even within a single model. Two L3 bugs remain out of reach of any cell within the 200-example budget.
L1–L3 labels track empirical hardness
Averaged across all eight models, mean recall descends monotonically: L1 66% → L2 60% → L3 51%. This ordering holds in 15 of 16 cells. For weaker models, the PBT benefit itself decreases with difficulty: Qwen 3.5-30B-A3B gains +35.2 / +25.4 / +14.8 pp on L1/L2/L3, suggesting the scaffold substitutes for constraint-reasoning on easy bugs but cannot close the gap on L3 protocol violations.
Failure modes are categorically different between tracks
Over 20 sampled failures per cell, Baseline failures concentrate on
Incorrect Assertion (59%) and Overly Concrete Test (32%) —
correct-looking tests aimed at the wrong invariant.
PBT failures shift to search pathologies:
assume() misuse (31%), where the agent excludes the bug's
trigger region via over-filtering, and Wrong Strategy Range (31%),
where the input distribution misses the narrow trigger.
Both failure modes are prompt-actionable without architectural changes.
Problem Set
100 problems across 40 real Python libraries in 7 domains. Each problem contains one or more injected semantic bugs (365 total, mean 3.65 per problem) organized into three difficulty levels.
Date & Time
Data Structures
Serialization & Encoding
Type Systems & Schemas
Numerics & Math
Graph & State Machines
Parsing & Documents
Difficulty Levels
| Level | Count | Description | Trigger requirement |
|---|---|---|---|
| L1 | 87 | Single-constraint bugs. Default Hypothesis strategies suffice or a single doc-derived constraint is needed. | One input range, size class, or structural property |
| L2 | 184 | Multi-constraint triggers requiring simultaneous satisfaction of several documentation-derived conditions. | Specific value range and structural constraints and operation ordering |
| L3 | 94 | Cross-function or cross-operation protocol violations requiring stateful or metamorphic tests. | Invariant spans multiple API calls; requires sequences or metamorphic checks |
Get Started
Requirements
- Python 3.12
- uv package manager
- Docker (for isolated evaluation containers)
- An LLM API key — Anthropic, OpenRouter, or compatible
1 — Install
cd pbt-bench
# Create virtual environment and install dependencies
uv venv .venv --python 3.12
uv pip install -r requirements.txt
uv pip install \
vendor/software-agent-sdk/openhands-sdk \
vendor/software-agent-sdk/openhands-tools \
vendor/software-agent-sdk/openhands-workspace \
vendor/software-agent-sdk/openhands-agent-server 2 — Configure your LLM
cp llm_configs/llm_config_example.json eval/llm_config.json
# then edit eval/llm_config.json and fill in your API key
Supported: Anthropic, OpenRouter, or any OpenAI-compatible endpoint.
See llm_configs/ for format examples.
3 — Run
# PBT mode — first 5 problems, 40 agent iterations each
.venv/bin/python3 eval/run_pbt.py eval/llm_config.json \
--n-limit 5 --max-iterations 40 --note my_run
# Baseline mode
.venv/bin/python3 eval/run_baseline.py eval/llm_config.json \
--max-iterations 50 --note my_run
# Or run both tracks via the shell script
bash scripts/run_eval.sh Output
Results stream to experiments/eval_outputs/<mode>/<model>_<note>/<timestamp>/
as JSONL — crash-safe and resumable. Each run directory contains
output.jsonl (per-problem results), summary.json (aggregate metrics),
and _workspaces/<ID>/pbt_test.py (the agent's test file).
Use eval/display.py to view live progress during a run.