I ( & Claude ofc) built a site for measuring instruction files across models.

A developer created MarkedDown, a site that tests instruction files across multiple AI models using a deterministic test suite with 20 test cases across five diagnostic categories to measure how well prompts perform consistently across different models. When files score above 90%, the platform activates a difficulty-ratcheting system where one model generates harder variants of passed test cases while another judges performance. The tool also includes Drift Watch to monitor for regressions when models update.

Detailed Analysis

A developer operating under the handle of a self-described novice has built and launched MarkedDown, a web platform designed to score and compare instruction files — commonly referred to as system prompts or CLAUDE.md-style configuration files — across multiple large language models in a reproducible, deterministic framework. The project was directly inspired by a April 2026 arXiv paper (2604.01687) on self-evolving skills, and it addresses a well-documented but underserved problem in applied AI development: the same instruction file can produce dramatically different behavior depending on which model executes it. The platform runs each submitted file through a structured 20-case test suite divided across five diagnostic categories — Format, Priority, Edge Cases, Consistency, and Output — and assigns each model a score from 0 to 100. The spread between the highest and lowest score across models yields a portability rating, with tight spreads earning a "Highly Portable" badge and wide spreads flagging the file as model-specific.

The platform's most technically notable feature is what the developer calls the "difficulty ratchet," a three-role co-evolutionary loop that activates when a model clears the base tier with a score of 90 or above. In this loop, a Student role executes the instruction file as a system prompt, a Tutor role generates progressively harder variants of test cases the Student already passed, and an Oracle role judges responses against the Tutor's rubric. The mechanism draws explicitly from the referenced self-evolving skills paper and is designed to probe depth rather than introduce out-of-distribution surprises — each harder case targets the same underlying capability as the one it replaces. Critically, the system is never cached and is fail-open, meaning bad outputs from the Tutor or Oracle do not penalize the file author. A companion feature called Drift Watch re-runs scoring whenever a model version changes, converting a one-time evaluation into an ongoing contract between the instruction file and its target models.

The empirical results the developer shares are themselves analytically significant. In a test of a "Tone Matcher" writing skill — an instruction file directing a model to rewrite text to match a given voice — GPT-4o mini, Gemma 4 31B, Qwen3 235B, and MiniMax M2.7 all scored 100, while GLM-5.1 scored 60 and Claude Haiku 4.5 scored 40, producing a spread of 60 points. The developer's diagnosis is pointed: the instruction file leaned on structural, literal cues that GPT-4o mini and Gemma followed precisely, while Claude's architecture or training disposition caused it to attempt to "improve" the output rather than obey the contract. This finding aligns with a known behavioral characteristic of Claude models, which tend toward helpfulness heuristics that can override strict instruction compliance in ambiguous cases — a tendency that reflects Anthropic's constitutional training emphasis on being genuinely helpful rather than merely compliant.

This project sits at the intersection of several converging trends in LLM evaluation and deployment infrastructure. As instruction files have proliferated — from OpenAI's system prompt hierarchies described in its 2025 Model Spec to GitHub Copilot's `.github/copilot-instructions.md` conventions to Anthropic's own CLAUDE.md tooling — the absence of cross-model portability testing has become a genuine operational liability for developers building on top of multiple providers. Academic efforts like ManyIFEval and StyleMBPP have begun formalizing multi-instruction evaluation, and MarkedDown represents an applied, user-facing instantiation of that same need. The difficulty ratchet mechanism in particular reflects a broader move in evaluation methodology toward adaptive, generative benchmarks that resist saturation, rather than static test suites that models can effectively memorize or overfit through training data contamination.

The project's most consequential long-term contribution may be the Drift Watch feature, which reframes instruction file evaluation not as a one-time audit but as a continuous monitoring contract. As model providers push updates — sometimes silently altering instruction-following behavior between versions — developers who depend on stable, predictable outputs from instruction files currently have no systematic way to detect regressions. MarkedDown's architecture addresses this gap directly, and if the platform attracts a community of published instruction files as the developer hopes, it could become a meaningful empirical record of how instruction compliance evolves across model generations. The Claude Haiku 4.5 result in particular underscores why such infrastructure matters: a model widely assumed to be a strong instruction follower scored last in a clean comparative run, not because of a bug, but because of a design philosophy that prioritizes generative judgment over strict literal compliance.

Read original article →

Detailed Analysis

Don't Miss a Deploy