PolyRange: Contamination-resistant offensive-AI benchmark for web targets

PolyRange is a cyber-AI evaluation benchmark designed to address limitations in existing approaches, which either rely on static targets that enter training corpora or test against undefined defensive infrastructure. The framework generates fresh environments for each deployment using large language models and incorporates active defense tiers to approximate real-world conditions. It covers 84 WSTG-derived vulnerability classes across OWASP testing categories using real backends and is released as MIT-licensed open-source software.

Detailed Analysis

PolyRange represents a methodological intervention in the field of offensive AI benchmarking, developed independently by the CEO of Aether AI in response to a widely acknowledged but underaddressed problem in cyber-AI evaluation. The core premise is that existing benchmarks — whether CTF-style platforms like DVWA and NYU CTF Bench or bug-bounty-style frameworks like XBOW — suffer from a fundamental contamination problem: their static architectures mean their contents are likely absorbed into the training corpora of the very models being evaluated, making benchmark scores increasingly unreliable as indicators of real-world capability. Version 1.0 ships with 84 vulnerability classes derived from the OWASP Web Security Testing Guide (WSTG) across all 12 OWASP categories, with real backend infrastructure including Postgres, PHP, Jinja2, and shell environments, alongside two defense tiers intended to approximate active-defender conditions absent from most current evaluation ranges.

The significance of PolyRange lies in its structural approach to contamination resistance. Rather than maintaining a fixed set of challenge instances, the framework generates fresh targets dynamically via a researcher-specified LLM at evaluation time, meaning no static artifact exists that future training pipelines can consume. This directly addresses a concern Anthropic raised in its Claude Mythos system card, in which the company acknowledged that its own published evaluations risk contributing to benchmark saturation — effectively accelerating the obsolescence of the measurements they produce. By making the generative model itself a parameter, PolyRange also aligns with OpenAI's stated preference for newly constructed private tasks in third-party evaluations and with guidance from DeepMind's Nicholas Carlini advocating against reliance on standardized public benchmarks.

The framework also attempts to close what multiple labs have identified as a second gap: the absence of defensive realism. UK AISI's Folkerts et al. paper on multi-step cyber evaluation explicitly noted that its ranges featured no active defenders and were static in nature. PolyRange's two defense tiers are designed to introduce at least approximations of defensive tooling and infrastructure conditions, though the author is candid that a full empirical paper with publishable sample sizes depends on partnership funding not yet secured. The methodological contribution — the framework itself, released under MIT license — is thus currently separable from the empirical results it is intended to eventually support.

Viewed against the broader trajectory of AI safety and capability evaluation, PolyRange reflects a maturing recognition across major labs that benchmark infrastructure has not kept pace with model capability growth. The field is experiencing what might be called an evaluation debt: widely used benchmarks are becoming saturated or contaminated faster than replacements are being designed. The author's disclosure that this work is independent of Aether AI's commercial roadmap and released openly is notable, as it positions PolyRange as a public-goods contribution to evaluation methodology rather than a proprietary competitive tool — a distinction that carries credibility given the commercial incentives that could otherwise shape such a framework. The two-bucket entropy framing the author references, which attempts to separate exploit-recall axes from cosmetic or realism axes in benchmark design, represents a conceptual contribution to evaluation theory that, if validated, could influence how the broader research community structures offensive AI assessments going forward.

Read original article →

Detailed Analysis

Don't Miss a Deploy