Detecting and preventing distillation attacks - Anthropic

Detailed Analysis

Anthropic's February 2026 disclosure of industrial-scale distillation attacks against Claude represents one of the most detailed public accounts of systematic AI model theft to date. Three Chinese AI laboratories — DeepSeek, Moonshot (known for its Kimi models), and MiniMax — collectively created approximately 24,000 fraudulent accounts and generated over 16 million synthetic exchanges by scraping Claude's outputs at scale. The campaigns were not opportunistic but highly targeted, focusing specifically on Claude's most differentiated capabilities: agentic reasoning, tool use, and coding. By feeding these outputs back into their own training pipelines, the labs sought to acquire, at minimal cost, the performance characteristics that Anthropic spent significant resources developing. The activity constitutes a direct violation of Anthropic's terms of service and, in certain cases, applicable regional restrictions on technology transfer.

The detection methodology Anthropic deployed reflects a layered, intelligence-driven approach. Classifiers and behavioral fingerprinting analyze API traffic for patterns that diverge sharply from legitimate usage — particularly the systematic elicitation of chain-of-thought reasoning traces, which are a hallmark of distillation-oriented prompting rather than ordinary product development. IP correlation, request metadata, and infrastructure indicators allowed for attribution across accounts that were deliberately designed to blend illicit traffic with legitimate requests through proxy networks. The MiniMax campaign proved especially instructive: Anthropic detected it mid-training, giving the company a rare opportunity to observe the full operational lifecycle of a distillation attack and calibrate its defenses accordingly. Partner reports also contributed to detection, suggesting that cross-industry intelligence sharing is already a functional component of the defensive ecosystem.

Anthropic's countermeasures operate at multiple layers of the product stack. On the prevention side, the company has implemented stronger identity verification requirements for categories of accounts considered high-risk for abuse, including those registered as educational institutions, research organizations, or early-stage startups — precisely the types of entities that would be used to establish plausible cover. At the model and API level, Anthropic has deployed safeguards designed to degrade the quality of outputs specifically useful for distillation — such as altering or suppressing detailed chain-of-thought traces — without materially affecting the experience of legitimate users. Intelligence on technical indicators is being shared with other AI laboratories, cloud infrastructure providers, and relevant authorities, signaling an industry-wide effort to raise the collective cost of these operations.

The broader significance of this disclosure extends well beyond Anthropic's immediate competitive interests. Model distillation is a technically legitimate technique with wide academic and commercial use, but its weaponization to bypass the enormous computational and research costs of frontier model development represents a structural challenge to the economics of AI safety investment. Companies like Anthropic justify frontier development partly on the premise that safety-focused labs should remain at or near the capability frontier; systematic theft of those capabilities by actors without comparable safety commitments undermines that rationale. The attacks also illustrate the limits of contractual and technical controls alone: Anthropic itself acknowledges that users selling real interaction data, or adversaries adapting their prompting strategies to evade classifiers, remain persistent challenges.

The incident sits at the intersection of AI competitiveness, national security, and IP enforcement, and it is likely to accelerate regulatory and legislative attention in all three domains. Anthropic's explicit reference to export controls on semiconductor hardware as a complementary deterrent signals that the company views this as a problem requiring policy solutions alongside technical ones. The disclosure also sets a notable precedent for transparency: by naming specific organizations, describing attack methodologies in technical detail, and publishing detection and prevention frameworks, Anthropic is effectively establishing a public record that other frontier labs can benchmark against. Whether competitors adopt similar disclosure practices — and whether regulators treat distillation attacks as a category warranting formal legal remedies — will shape how the industry responds to this class of threat going forward.

Read original article →

Detailed Analysis

Don't Miss a Deploy