Claude Vs Codex — Claude Learning Daily

A developer built a custom benchmark to compare Claude and Codex using their actual workflow rather than relying on potentially gamed published benchmarks. The benchmark tests four key stages of professional development work and uses an LLM judge to grade outputs against a rubric. Results are tracked regularly to spot performance trends and potential behind-the-scenes changes to model behavior.

Detailed Analysis

A professional developer and agency owner has constructed a custom benchmark designed to evaluate Claude and OpenAI's Codex against the concrete demands of real-world software development workflows, publishing results at ClaudeVsCodex.com. Frustrated by the unreliability of published benchmarks — which are frequently subject to provider manipulation, harness updates, and model-level changes that obscure meaningful comparisons — the author built a four-stage evaluation pipeline modeled directly on their own daily development process. A blind judge LLM grades outputs against a defined rubric, removing human bias from the scoring process. The author intends to run the benchmark semi-regularly and track longitudinal trends, including the possibility of undisclosed model degradation or throttling by providers.

The effort reflects a growing skepticism within the developer community toward official benchmark figures, a skepticism that is well-founded. Both Anthropic and OpenAI publish benchmark results that, while technically accurate, are carefully curated to showcase each model's strongest domains. Claude's tools, for instance, achieve 92% on HumanEval and 80.8% on SWE-bench Verified — metrics that emphasize structured, single-task code generation — while Codex leads on Terminal-Bench 2.0 with 77.3%, a benchmark more reflective of autonomous, system-wide task execution. Neither set of numbers captures what a working developer actually encounters: the messy, iterative, context-dependent nature of building and maintaining software across a full project lifecycle.

The underlying competition between Claude and Codex is, in a meaningful sense, not a direct apples-to-apples rivalry but a contest between two fundamentally different philosophies of AI-assisted development. Claude Code, built on Anthropic's Claude models, operates as a developer-guided copilot emphasizing planning-first workflows, strong instruction adherence, and thoroughness — typically consuming three to four times more tokens per task to produce cleaner, well-documented, production-ready output. Codex, powered by GPT-5 High and trained with reinforcement learning to behave as an autonomous software engineering agent, prioritizes speed, cost efficiency, and the ability to execute long-running tasks asynchronously without constant developer input. The result is that each tool occupies a distinct niche: Claude for high-fidelity, reasoning-intensive code generation under developer supervision; Codex for delegated, parallelized execution at scale.

The developer's decision to build a workflow-specific benchmark is significant precisely because it exposes this niche differentiation. General-purpose benchmarks flatten architectural distinctions into a single performance score, obscuring the fact that the "better" model depends entirely on what a developer is trying to accomplish. For an agency developer running multiple projects simultaneously — as the article's author describes — the relative weight of instruction-following fidelity versus autonomous execution speed is not abstract; it directly affects productivity and code quality. Custom benchmarks grounded in real workflows represent a methodological correction to the incentive-distorted landscape of provider-published evaluations.

Broader trends in AI development make this kind of community-led evaluation increasingly important. As model providers iterate rapidly — sometimes pushing quiet updates that alter behavior without public announcement — the gap between advertised capability and experienced performance can widen without any formal disclosure. The author's plan to track historical results and flag potential behind-the-scenes nerfing reflects an emerging norm of developer-as-auditor, where practitioners treat AI tools with the same empirical skepticism applied to any other piece of production infrastructure. This mirrors a wider maturation in how the software development community is learning to interact with AI systems: not as monolithic, trustworthy authorities, but as dynamic, commercially motivated products requiring ongoing independent scrutiny.

Read original article →

Detailed Analysis

Don't Miss a Deploy