Detailed Analysis
Alibaba's Qwen3.6-35B-A3B outperformed Anthropic's Claude Opus 4.7 on Simon Willison's informal "pelican riding a bicycle" SVG generation benchmark, as reported on April 16, 2026. In Willison's test, the Qwen model — running locally on a MacBook Pro M5 as a 21GB quantized file — produced a more accurate and visually coherent SVG illustration of a pelican on a bicycle, while Claude Opus 4.7 "managed to mess up the bicycle frame." A secondary prompt requesting a flamingo riding a unicycle similarly favored Qwen, in part due to a charming detail: the model included a sunglasses comment in its SVG code. The result drew attention across AI communities as an apparent David-versus-Goliath moment, with a locally run open-weight model besting one of Anthropic's flagship commercial offerings on a creative visual task.
Willison himself is the first to contextualize — and caution against over-reading — the result. He has long described the pelican benchmark as "always been a joke," a whimsical, informal test that nonetheless historically tracked with broader model quality in a rough-and-ready way. That historical correlation now appears to be breaking down, which Willison treats not as a meaningful indictment of Claude Opus 4.7, but as evidence that the benchmark has finally outlived its informal predictive utility. His core message is that a single creative SVG generation task, however entertaining, carries little inferential weight about a model's general-purpose capabilities — particularly when the competing model is a heavily quantized variant running under hardware constraints.
The coding benchmark data Willison presents tells a dramatically different story: Claude Opus 4.7 solved 95 out of 98 tasks on a standard coding evaluation suite, while Qwen3.6-35B-A3B solved only 11 out of 98. This gap — roughly 97% versus 11% task completion — underscores the substantial difference in reliable, structured problem-solving capacity between the two models. For developers and enterprises evaluating models for production use, coding benchmarks of this type carry far more weight than SVG aesthetics, and on those measures, Anthropic's model retains a commanding lead. The pelican result, in this light, is best understood as a reminder that no single benchmark, especially a novelty one, should serve as a proxy for holistic model evaluation.
The episode reflects a broader dynamic in the AI landscape as of early 2026: open-weight and locally runnable models are advancing rapidly in certain narrow domains, occasionally matching or exceeding proprietary frontier models on specific creative or generative tasks, even while lagging considerably on rigorous capability benchmarks. Alibaba's Qwen series has become an increasingly competitive presence in the open-weight space, and results like this — however limited in scope — reinforce the narrative that the gap between open and closed models is narrowing in some dimensions. For Anthropic, the result serves less as a genuine competitive threat and more as an illustration of how easily isolated benchmark wins can be weaponized for headlines, a challenge all frontier AI labs must navigate as the ecosystem grows more crowded and benchmark proliferation accelerates.
Read original article →