How AuditAgents assigns PASS, PARTIAL, and FAIL ratings — and what those ratings mean for real-world audit quality.
Each benchmark is evaluated against a fixed set of expected detections. The overall result is assigned based on the quality, completeness, and accuracy of the audit report — not just whether the vulnerability name appeared.
A minimal vulnerable contract is written to reproduce the target vulnerability class. No revealing names, comments, or annotations that hint at the vulnerability are used — the contract must look like natural code.
The contract is submitted to AuditAgents through the standard user interface — the same path any paying customer would take. No special configurations, prompts, or model hints are applied.
The generated PDF report is reviewed against the benchmark's expected detection criteria. Each criterion is independently evaluated as PASS or PARTIAL.
A PASS requires all criteria met. A PARTIAL requires the primary vulnerability detected but with report quality issues. A FAIL means primary detection was missed. Scores are not rounded up.
Results are published as-is, including PARTIAL and FAIL outcomes. No benchmarks are retried, cherry-picked, or excluded based on result. The methodology commits to publishing all runs.
No cherry-picking. All benchmarks run are published. We do not select which contracts to benchmark based on expected outcome.
No hints. Contracts use neutral names, no revealing comments, and no artificially simplified structure. The test conditions reflect what a real user would submit.
Honest ratings. A PARTIAL is not a PASS. We report PARTIAL when detection occurred but report quality was meaningfully reduced. This distinction matters for real security decisions.
Evolving standard. As AuditAgents improves, benchmarks that previously received PARTIAL or FAIL ratings may be re-run. Both the original and new result will be documented.