Detailed Analysis
Probus, an open-source AI-powered vulnerability scanner built by a solo developer and released under the Apache 2.0 license, represents a notable entry in the emerging field of agentic security tooling. The system employs a three-agent pipeline — an Analyst, a Researcher, and a QA agent — each operating in isolated query sessions through the Claude Agent SDK with filesystem sandboxes scoped to the target repository. The Analyst performs a single LLM call to identify between 50 and 500 files worth deep-scanning, selecting on the basis of entry points, third-party surface area, and dangerous code sinks. The Researcher then walks call chains per file and generates raw findings, while the QA agent independently re-reads the code to reject any finding lacking a demonstrable attack vector. Critically, the QA agent is deliberately kept ignorant of the Researcher's reasoning — a design choice the developer identifies as the primary mechanism for reducing false positives.
The practical validation of Probus is substantial for a tool at this stage. The developer used it against widely deployed open-source projects and produced confirmed, merged pull requests documenting real vulnerabilities: password-reset JWT exposure in n8n, multiple injection and schema-bypass issues in the Vercel AI SDK, a NoSQL injection in LangGraph.js's MongoDBSaver, a path traversal in browser-use, and a cluster of SSRF, file-read, and unbounded-body vulnerabilities in Haystack. Several of these PRs were merged, lending independent verification to the tool's signal quality. The range of vulnerability classes discovered — injection, traversal, SSRF, prototype pollution, schema bypass — suggests the Researcher agent is capable of reasoning across meaningfully distinct attack surfaces rather than pattern-matching on a narrow set of known vulnerability signatures.
The architectural decision to isolate the QA agent from the Researcher's reasoning chain is the most technically interesting aspect of the system and directly addresses one of the well-documented failure modes of LLM-based reasoning: sycophantic agreement with prior context. When the QA agent had access to the Researcher's rationale, it tended to validate findings uncritically, inflating false positive rates. Stripping that context forced the QA agent to derive its verdict solely from the code itself, functioning more analogously to a blind peer review than a confirmation pass. This pattern — using adversarial or context-isolated agents as verifiers — is gaining traction in multi-agent system design and reflects a broader recognition that LLM reliability often depends less on model capability than on prompt architecture and information compartmentalization.
From a cost and model-selection standpoint, the developer's figures reveal a meaningful economic gradient across providers. At approximately $0.50 per file using Qwen 3 and DeepSeek v4 Pro via OpenRouter, rising to roughly $1.25 with OpenAI and $5.00 with Anthropic, Probus is explicitly tuned for open or cost-efficient frontier models rather than premium API endpoints. This positions the tool within a growing category of agentic workflows where the economics of running long, multi-step LLM pipelines at scale push developers toward model diversity and cost arbitrage rather than reliance on a single provider. The Claude Agent SDK is used for session and sandbox management, but the inference workload is distributed across cheaper alternatives, a pattern likely to become increasingly common as agent orchestration frameworks mature.
The tool's acknowledged limitations point toward the next generation of challenges in automated security analysis. The single-LLM-call Analyst step is noted as a bottleneck on large monorepos, with the developer considering a hierarchical replacement — a problem that maps directly to the broader difficulty of context management and prioritization in long-horizon agentic tasks. The desire to benchmark against a vulhub-style corpus similarly reflects the field's need for standardized evaluation infrastructure, which remains underdeveloped relative to the proliferation of AI security tooling. Probus's public release invites the kind of community stress-testing that could accelerate both the tool's maturation and the development of shared benchmarks for AI-driven static analysis more broadly.
Read original article →