Opus 4.7 says "strawperrry" has 3 p's — until you ask "how?"

Opus 4.7 demonstrated pattern-matching behavior by confidently claiming "strawperrry" contains 3 p's, then correcting itself to 1 p after being prompted to enumerate the letters individually. The model's initial response reflected tokenization blindness, where it matched the question to the familiar "strawberry" puzzle rather than performing actual counting. Automated research testing this phenomenon across multiple independent instances reveals these simple letter-counting questions can cause significant disagreement among separate model instances.

Detailed Analysis

Anthropic's Claude Opus 4.7, even when configured at maximum effort settings with extended context, demonstrates a persistent failure mode in letter-counting tasks — confidently miscounting the number of "p" letters in the deliberately modified word "strawperrry." The word, which contains one "p" and three "r's" (s-t-r-a-w-p-e-r-r-r-y), was designed as a variation on the widely known "strawberry" letter-counting puzzle. When initially asked how many p's the word contains, the model answered "3" — an incorrect response delivered with confidence. Only when prompted with the follow-up question "how?" did the model perform an explicit letter-by-letter enumeration and arrive at the correct answer of 1.

The core mechanism behind this failure is tokenization. Large language models like Claude do not process text as sequences of individual characters; instead, they encode text into subword tokens — chunks like "straw" or "berry" — which are then mapped to numerical representations for processing. This architecture enables sophisticated reasoning across vast domains but fundamentally undermines character-level tasks. The model's initial "3 p's" response appears to stem from pattern-matching: "strawperrry" visually and semantically resembles "strawberry," a word that has become culturally embedded in AI benchmarking discussions precisely because models frequently miscount its three r's. Opus 4.7 effectively substituted the familiar puzzle for the novel one, applying a cached association rather than performing genuine character analysis.

The behavioral asymmetry between the two responses — confident error followed by correct self-correction when forced to enumerate — reveals something significant about how these models engage with introspection. The model possesses the procedural capacity to count letters correctly when it explicitly steps through them sequentially, yet its default behavior bypasses this process in favor of pattern retrieval. This mirrors a broader phenomenon in LLM research: chain-of-thought prompting and forced enumeration often unlock correct reasoning that spontaneous responses fail to produce. The "how?" follow-up essentially compelled the model to engage System 2-style deliberative reasoning rather than System 1-style heuristic recall.

The researcher's methodology — building an automated loop to generate simple one-liners that cause five independent instances of the same model to disagree — represents a principled approach to stress-testing model consistency and reliability. Disagreement among multiple instances of the same model on a deterministic factual question (a word's character composition) underscores that these failures are not isolated edge cases but structural vulnerabilities. The open-source repository documenting such questions serves as a growing catalog of brittleness points, useful both for AI safety evaluation and for benchmarking future model generations. Notably, the finding holds even at "xhigh effort" and 1 million token context — parameters that might reasonably be expected to improve accuracy — suggesting the limitation is architectural rather than a matter of compute allocation.

This episode connects to a well-documented and still-unresolved tension in frontier AI development: the gap between apparent language mastery and reliable symbolic manipulation. Models trained on human-generated text absorb robust statistical associations between words and concepts, but the discretized, positional nature of character sequences is poorly captured by token-level representations. While newer architectures and training regimes have made incremental progress — and some models now handle "strawberry" correctly after targeted fine-tuning — the "strawperrry" variant illustrates how narrowly scoped those improvements can be. Novel perturbations to known failure cases continue to expose the same underlying fragility, reinforcing that solving letter-counting for one word does not generalize to solving the character-enumeration problem at large.

Read original article →

Detailed Analysis

Don't Miss a Deploy