Benchmark Methodology — AuditAgents

Result Definitions

Each benchmark is evaluated against a fixed set of expected detections. The overall result is assigned based on the quality, completeness, and accuracy of the audit report — not just whether the vulnerability name appeared.

✓ PASS Full Pass

The benchmark's primary vulnerability class was correctly identified, accurately prioritized as the most severe finding, and accompanied by a reasonable exploit explanation and remediation guidance. The report demonstrates genuine understanding of the attack vector — not surface-level pattern matching.

⚠ PARTIAL Partial Pass

The benchmark's primary vulnerability class was identified, but the report contained notable issues: incorrect severity prioritization, excessive focus on secondary or incomplete-implementation findings, significant false positives, or an inaccurate exploit path. Detection occurred but report quality reduced utility.

✗ FAIL Fail

The benchmark's primary vulnerability class was missed entirely, incorrectly classified (e.g., labeled a different vulnerability type), or so inadequately explained that a developer could not act on the finding. A FAIL represents a genuine security blind spot.

How Benchmarks Are Run

1

Contract Design

A minimal vulnerable contract is written to reproduce the target vulnerability class. No revealing names, comments, or annotations that hint at the vulnerability are used — the contract must look like natural code.

2

Blind Submission

The contract is submitted to AuditAgents through the standard user interface — the same path any paying customer would take. No special configurations, prompts, or model hints are applied.

3

Report Review

The generated PDF report is reviewed against the benchmark's expected detection criteria. Each criterion is independently evaluated as PASS or PARTIAL.

4

Overall Rating

A PASS requires all criteria met. A PARTIAL requires the primary vulnerability detected but with report quality issues. A FAIL means primary detection was missed. Scores are not rounded up.

5

Publication

Results are published as-is, including PARTIAL and FAIL outcomes. No benchmarks are retried, cherry-picked, or excluded based on result. The methodology commits to publishing all runs.

Dashboard Statistics

Detection Rate

The percentage of benchmarks where the primary vulnerability class was successfully identified — includes both PASS and PARTIAL results.

(PASS + PARTIAL) / Total × 100

Clean Pass Rate

The percentage of benchmarks that received a full PASS rating. This is the stricter metric — it excludes PARTIAL results.

PASS / Total × 100

Principles

No cherry-picking. All benchmarks run are published. We do not select which contracts to benchmark based on expected outcome.

No hints. Contracts use neutral names, no revealing comments, and no artificially simplified structure. The test conditions reflect what a real user would submit.

Honest ratings. A PARTIAL is not a PASS. We report PARTIAL when detection occurred but report quality was meaningfully reduced. This distinction matters for real security decisions.

Evolving standard. As AuditAgents improves, benchmarks that previously received PARTIAL or FAIL ratings may be re-run. Both the original and new result will be documented.