Work in Progress

Benchmark AI Agents on
Property-Based Bug Discovery

PBT-Bench evaluates whether AI agents can discover hidden semantic bugs in Python libraries using property-based testing with Hypothesis — guided solely by official API documentation.

61 Problems
28 Libraries
230+ Injected Bugs
L1–L4 Difficulty Levels

A different kind of benchmark

Most code benchmarks ask AI to fix a known bug or pass visible test gaps. PBT-Bench asks a harder question: can an agent discover bugs that no one told it about?

🔍

Bugs Invisible to Inspection

Every injected bug survives a 10-minute code review and passes all existing unit tests. Discovery requires systematic semantic reasoning, not grepping the source.

📚

Documentation-Driven

Agents receive official API documentation as their primary oracle. Bugs hide in the gap between what the docs promise and what the implementation delivers.

Automated F→P Evaluation

The Fail-to-Pass criterion requires zero human judgment: a test must FAIL on the buggy library and PASS on the fixed version. Fully reproducible by anyone.

🔒

Hidden Ground Truth

Reference property tests are never shown to agents. Any test function that independently achieves F→P for a bug counts as a discovery.

How It Works

01

Bug Injection

Semantic bugs are injected into real Python libraries via unified diff patches. The bugs are designed to be undetectable through code inspection and invisible to all existing unit tests — they only surface through systematic property testing.

02

Documentation as Oracle

The agent receives official API documentation for the target library. Its task is to infer semantic invariants from the docs: contracts the library promises to uphold. Source code is accessible but intentionally insufficient.

03

Property Test Design

The agent writes Hypothesis @given property tests checking semantic contracts — roundtrip consistency, commutativity, idempotency, algebraic laws, and more. Crafting focused input strategies from documentation is the critical challenge.

04

F→P Evaluation

Each test function is evaluated independently against each (buggy_lib, fixed_lib) pair. A test "finds" a bug when it FAILs on the buggy version and PASSes on the fixed version. Bug recall is computed per problem.

Two Evaluation Tracks

Baseline

Unconstrained Testing

The agent may use any approach: unit tests, code inspection, manual exploration, or arbitrary test generation. This measures what's achievable with current best practices — no Hypothesis required.

PBT

Property-Based Testing

The agent must use Hypothesis @given decorator with custom input strategies. The key skill is deriving precise strategies from API documentation rather than defaulting to random inputs.

Problem Set

The current evaluated set spans 28 Python libraries across diverse domains. Each problem contains injected semantic bugs organized into four difficulty levels. Evaluated on Claude Sonnet 4.6 and GLM-5 across Baseline and PBT tracks.

Date & Time

arrowpendulumdateutilicalendar

Math & Scientific

mpmathsympygaloispint

Type Systems

attrscattrsmarshmallow

Parsing & Documents

parsopyparsingpypdfhtml5lib

Data Structures

sortedcontainersbidictportioncachetoolstoolz

Config & Graphs

tomlkitnetworkxtransitionsboltonsopenpyxl

Encoding & Algorithms

msgpackpyasn1codeforces

Difficulty Levels

Level Description Example trigger
L1 Default Hypothesis strategies are sufficient to find the bug Integer boundary overflow (uint16 sign error at [32768, 65535])
L2 A non-default, documentation-derived strategy is required Size thresholds, specific parameter combinations
L3 Cross-function state sequences or algebraic property chains FSM callback ordering, precision thresholds, protocol conformance
L4 External specification or engineering knowledge required DER/OID encoding standards, RFC compliance checks
Each problem directory includes official API documentation, existing unit tests (all passing on the buggy library), and hidden ground-truth property tests for validation.

Get Started

Requirements

  • Python 3.12
  • uv package manager
  • Docker (for isolated evaluation environments)
  • An LLM API key — Anthropic, OpenRouter, or compatible

1 — Clone and install dependencies

# Clone the repository
git clone https://github.com/ElliotXinqiWang/pbt-bench.git
cd pbt-bench

# Create a Python 3.12 virtual environment
uv venv .venv --python 3.12

# Install the evaluation framework (vendored OpenHands SDK)
uv pip install \
    vendor/software-agent-sdk/openhands-sdk \
    vendor/software-agent-sdk/openhands-tools \
    vendor/software-agent-sdk/openhands-workspace \
    vendor/software-agent-sdk/openhands-agent-server

2 — Configure your LLM

// eval/llm_config.json
{
  "model": "anthropic/claude-sonnet-4-6",
  "api_key": "your-api-key-here"
}

OpenRouter and local model configs are also supported — see llm_configs/ in the repo for examples.

3 — Run the benchmark

# Edit MODE, MAX_WORKERS, N_LIMIT at the top of run_eval.sh
bash run_eval.sh

# Or run a single track directly:
.venv/bin/python3 eval/run_pbt.py eval/llm_config.json \
    --n-limit 5 --max-iterations 40 --note my_first_run

Output

Results are written to eval_outputs/ as streaming JSONL — crash-safe and resumable. Each run includes per-problem agent traces, test files written by the agent, and a summary of F→P counts per bug.