Nelson v2.2.3 shipped, and a benchmark I built ranked it 3rd out of 13 agent/harness/skill setups on a discrete-event sim task

The author released Nelson v2.2.3, a multi-agent coordination skill for Claude Code, and constructed a benchmark testing 13 different model and skill configurations on a discrete-event simulation task. Nelson ranked third with a quality score of 95, trailing ouroboros-max-thinking (97) and plan-mode (96), both using Claude Opus with extended thinking. The benchmark demonstrated that the underlying model and whether thinking was enabled were the dominant factors in performance, while skill choice represented a smaller performance difference.

Detailed Analysis

Nelson v2.2.3, a multi-agent coordination skill built for Claude Code, shipped alongside a developer-constructed benchmark that evaluates 13 distinct combinations of model, CLI, and skill framework against a discrete-event simulation task modeled on synthetic mine throughput. The benchmark, publicly accessible at simulation-bench.fly.dev, represents a deliberate attempt to replace qualitative "vibes-based" comparisons of agent frameworks with reproducible numerical scores. Nelson placed third out of 13 configurations with a quality score of 95, behind ouroboros-max-thinking (97) and Claude Code's native plan-mode (96), and ahead of superpowers-max-thinking (94), vanilla max-thinking (92), and Sonnet-based vanilla configurations (85). The framework uses a Royal Navy hierarchy metaphor — admiral, captains, ships, and crew — to coordinate parallel agents and prevent them from overwriting each other's work, a problem that becomes acutely visible at scale.

The benchmark's most significant finding is not Nelson's placement but the underlying signal about what drives quality differences. The developer expected curated skill frameworks to open a meaningful gap over vanilla baselines; they did not. The dominant variable was model selection paired with extended thinking mode. Five configurations — all running on Claude's opus-4-7 with thinking enabled — clustered within five points of each other, suggesting that within that tier, framework choice amounts to a preference rather than a performance decision. The "actual cliff," as the developer describes it, sits between opus-with-thinking and everything else, a gap large enough that no skill layer applied to a weaker model or thinking-disabled configuration could close it.

The unexpectedly strong performance of Claude Code's built-in plan-mode — a stock, unskilled configuration — finishing second is particularly notable. It directly challenges a common assumption in the agent tooling community: that purpose-built orchestration layers and carefully curated skill sets provide substantial, reliable uplift over baseline model capabilities. The result instead implies that for at least this class of structured reasoning task, the architecture of the underlying model and its reasoning process are doing the heavy lifting, and framework complexity adds modest incremental value rather than transformative improvement.

The inclusion of competing frontier models adds cross-ecosystem texture. GPT-5 running through Codex scored 85, on par with Sonnet's vanilla-max configuration, while Gemini 3.1 Pro ranged from 67 to 81 depending on the wrapper — a spread the developer attributes to poorly tuned tooling rather than intrinsic model capability, flagging Gemini as potentially undersold in these results. This variance across wrappers for the same underlying model reinforces the benchmark's central lesson from a different angle: the harness and skill layer can depress performance when misconfigured, even if they rarely elevate it dramatically when well-tuned.

The benchmark carries honest, prominent caveats — a single task type, a rubric authored by the same person who built Nelson, and no consolidated cost-adjusted ranking despite token usage and execution time being tracked separately. The developer notes that on a per-token basis the quality ranking would likely shift, particularly disadvantaging ouroboros, which leads on quality but has not yet been fully costed. The project represents an early but methodologically self-aware attempt to inject empiricism into a space where agent framework comparisons have historically been driven by anecdote, and its open invitation for additional task submissions and configuration suggestions positions it as an evolving resource rather than a definitive verdict.

Read original article →

Detailed Analysis

Don't Miss a Deploy