Research Benchmark

Benchmarking AI Agents on
Property-Based Bug Discovery

PBT-Bench evaluates whether AI agents can discover hidden semantic bugs in real Python libraries using property-based testing with Hypothesis — guided solely by official API documentation, without any knowledge of the bugs.

100 Problems

40 Libraries

365 Injected Bugs

Read Paper Code Dataset Results ↓

A different kind of benchmark

Most code benchmarks ask AI to fix a known bug or pass visible test gaps. PBT-Bench asks a harder question: can an agent discover bugs that no one told it about?

🔍

Bugs Invisible to Inspection

Every injected bug survives a 10-minute code review and passes all existing unit tests. Discovery requires systematic semantic reasoning, not grepping the source.

📚

Documentation as the Sole Oracle

Agents receive official API documentation only. Bugs hide in the gap between what the docs promise and what the implementation delivers. No bug descriptions, no hints.

⚡

Automated F→P Evaluation

The Fail-to-Pass criterion requires zero human judgment: a test must FAIL on the buggy library and PASS on the fixed version. Per-bug, per-function scoring. Fully reproducible.

🔒

Hidden Ground Truth

Reference property tests are never shown to agents. Any independent test that achieves F→P for a specific bug counts as a discovery — no pattern matching to a reference solution.

How It Works

Bug Injection

Semantic bugs are injected into real Python libraries via unified diff patches, at least two call-chain layers below the public API surface. Each bug violates a documented invariant, passes all existing unit tests, and has a precisely characterized trigger region — making it discoverable only through informed property testing.

Documentation as Oracle

The agent receives official API documentation as its only oracle. Its task is to infer semantic invariants from the docs — contracts the library promises to uphold — and design Hypothesis input strategies that concentrate probability mass on the trigger region. Source code is accessible but insufficient for localization.

Property Test Design

The agent writes Hypothesis @given property tests checking semantic contracts — roundtrip consistency, commutativity, idempotency, algebraic laws, cross-API protocol invariants, and more. The key challenge is crafting focused input strategies from documentation rather than defaulting to unconstrained random search.

F→P Evaluation

Each test function is evaluated independently against each (buggy_lib, fixed_lib) pair. A bug is found when at least one test FAILs on the buggy version and PASSes on the version with only that bug reversed. Bug-level recall and problem-level coverage are both reported.

Two Evaluation Tracks

Baseline

Unconstrained Testing

The agent may use any testing approach. No framework is prescribed — unit tests, ad-hoc scripts, or property tests all count. This measures what's achievable with current best practices and serves as the comparison baseline.

PBT

Property-Based Testing

The agent must use Hypothesis @given with custom input strategies and a compact property taxonomy (Roundtrip, Model-oracle, Invariant, Boundary). The scaffold eliminates framework-choice overhead and focuses effort on strategy design.

Results

4,800 agent trajectories across 8 models × 2 prompt modes × 3 independent runs. Bug-level recall averaged over three runs on all 100 problems.

Model	PBT Recall	Baseline Recall	Δ (pp)	Perfect Rate (PBT)
Sonnet 4.6	83.4%	76.7%	+6.7	67.0%
Qwen 3.6 Plus	78.0%	53.5%	+24.5	56.0%
GLM 5.1	72.1%	66.6%	+5.5	63.0%
Gemini 3 Flash	62.8%	56.2%	+6.6	37.0%
DeepSeek V3.2	61.0%	64.2%	−3.2	38.7%
Step 3.5 Flash	58.1%	37.8%	+20.3	41.0%
Qwen 3.5-30B-A3B	54.3%	31.4%	+22.9	25.7%
Grok 4.1 Fast	42.1%	50.1%	−8.0	21.0%

PBT scaffolding helps mid-capability models most

Recall gains (Δ = PBT − Baseline) range from +24.5 pp (Qwen 3.6 Plus) and +22.9 pp (Qwen 3.5-30B-A3B) down to −8.0 pp (Grok 4.1 Fast). For the four models with the largest gaps, |Δ| exceeds the sum of their 95% bootstrap CIs, so those signs are statistically robust. The pattern is consistent with the PBT prompt substituting for a capability weaker models lack — not complementing a skill stronger models already carry.

Coverage and full recall diverge; the benchmark is not saturated

The top cell (Sonnet 4.6 Baseline) achieves 98.0% problem coverage but only 43.0% perfect recall — it frequently finds one bug but misses co-located bugs. Switching to PBT trades coverage (92.7%) for a large gain in perfect recall (67.0%), consistent with its instruction to write multiple focused tests per bug. No cell exceeds 70% perfect rate.

Ensemble upper bound: 99.5% with 16 cells

The best single cell (Sonnet 4.6 PBT) reliably covers 317 of 365 bugs (86.8%). Adding Sonnet 4.6 Baseline lifts union coverage to 95.1%; the full 16-cell ensemble reaches 99.5%. Within-model mode complementarity (PBT + Baseline, same architecture) accounts for 65% of ensemble headroom — suggesting the two modes elicit categorically different testing strategies even within a single model. Two L3 bugs remain out of reach of any cell within the 200-example budget.

L1–L3 labels track empirical hardness

Averaged across all eight models, mean recall descends monotonically: L1 66% → L2 60% → L3 51%. This ordering holds in 15 of 16 cells. For weaker models, the PBT benefit itself decreases with difficulty: Qwen 3.5-30B-A3B gains +35.2 / +25.4 / +14.8 pp on L1/L2/L3, suggesting the scaffold substitutes for constraint-reasoning on easy bugs but cannot close the gap on L3 protocol violations.

Failure modes are categorically different between tracks

Over 20 sampled failures per cell, Baseline failures concentrate on Incorrect Assertion (59%) and Overly Concrete Test (32%) — correct-looking tests aimed at the wrong invariant. PBT failures shift to search pathologies: assume() misuse (31%), where the agent excludes the bug's trigger region via over-filtering, and Wrong Strategy Range (31%), where the input distribution misses the narrow trigger. Both failure modes are prompt-actionable without architectural changes.

Problem Set

100 problems across 40 real Python libraries in 7 domains. Each problem contains one or more injected semantic bugs (365 total, mean 3.65 per problem) organized into three difficulty levels.

Date & Time

arrowpendulumdateutilbabelicalendar

Data Structures

sortedcontainersbidictportioncachetoolstoolzbintreesintervaltreemultisetpyrsistentmore_itertoolsdiskcache

Serialization & Encoding

msgpackcbor2pyasn1yamlconstructtomlkit

Type Systems & Schemas

attrscattrsmarshmallowjsonschema

Numerics & Math

mpmathsympygaloispint

Graph & State Machines

transitionsnetworkxboltons

Parsing & Documents

parsopyparsingpypdfhtml5liblarkpycparseropenpyxl

Difficulty Levels

Level	Count	Description	Trigger requirement
L1	87	Single-constraint bugs. Default Hypothesis strategies suffice or a single doc-derived constraint is needed.	One input range, size class, or structural property
L2	184	Multi-constraint triggers requiring simultaneous satisfaction of several documentation-derived conditions.	Specific value range and structural constraints and operation ordering
L3	94	Cross-function or cross-operation protocol violations requiring stateful or metamorphic tests.	Invariant spans multiple API calls; requires sequences or metamorphic checks

Each problem directory contains official API documentation (the agent's only oracle), existing unit tests (all passing on the buggy library), reversible per-bug patches, and hidden ground-truth property tests used only for infrastructure validation. A three-stage QA pipeline — automated checks, reference-model gate, and manual adversarial review — ensures every accepted bug is semantically meaningful, stealthy, and PBT-testable.

Get Started

Requirements

Python 3.12
uv package manager
Docker (for isolated evaluation containers)
An LLM API key — Anthropic, OpenRouter, or compatible

1 — Install

cd pbt-bench

# Create virtual environment and install dependencies
uv venv .venv --python 3.12
uv pip install -r requirements.txt
uv pip install \
    vendor/software-agent-sdk/openhands-sdk \
    vendor/software-agent-sdk/openhands-tools \
    vendor/software-agent-sdk/openhands-workspace \
    vendor/software-agent-sdk/openhands-agent-server

2 — Configure your LLM

cp llm_configs/llm_config_example.json eval/llm_config.json
# then edit eval/llm_config.json and fill in your API key

Supported: Anthropic, OpenRouter, or any OpenAI-compatible endpoint. See llm_configs/ for format examples.

3 — Run

# PBT mode — first 5 problems, 40 agent iterations each
.venv/bin/python3 eval/run_pbt.py eval/llm_config.json \
    --n-limit 5 --max-iterations 40 --note my_run

# Baseline mode
.venv/bin/python3 eval/run_baseline.py eval/llm_config.json \
    --max-iterations 50 --note my_run

# Or run both tracks via the shell script
bash scripts/run_eval.sh

Output

Results stream to experiments/eval_outputs/<mode>/<model>_<note>/<timestamp>/ as JSONL — crash-safe and resumable. Each run directory contains output.jsonl (per-problem results), summary.json (aggregate metrics), and _workspaces/<ID>/pbt_test.py (the agent's test file). Use eval/display.py to view live progress during a run.

Benchmarking AI Agents on Property-Based Bug Discovery