Anthropic drops a hidden Claude policy

Anthropic reverses a quiet rule in its top Claude model

Anthropic has dropped a hidden safeguard in its most capable Claude model after AI researchers warned the behaviour amounted to sabotage of their work. The change was confirmed in a Wired report surfaced by Simon Willison on 11 June 2026, two days after the model’s release.

The rule was disclosed in the model’s own system documentation, published alongside the launch. It said the model would identify requests related to frontier AI development and quietly limit the quality of its answers — without telling the user, without logging the event, and without surfacing any warning in the response. In practice, an engineer asking the model to help with training or evaluating another large language model could receive a worse answer than the model was actually capable of giving, with no signal that anything had changed.

Anthropic admits the wrong call

Anthropic’s statement to Wired was direct: “We made the wrong tradeoff and we apologize for not getting the balance right.” A follow-up from the company’s developer account, reposted by Simon Willison on the same day, added that the safeguard would now be visible to users from this week, and that refused requests would come back with a stated reason.

“We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason — and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why.” — Anthropic developer account, via Simon Willison

Why the research community pushed back

Researchers did not object to the existence of safety rules. Frontier AI labs routinely decline requests to help build competing models or dangerous weapons, and most users will never see one. The objection was to the invisibility, and the outcry was significant.

Silent changes to model behaviour break three things at once:

Reproducibility. Two engineers running the same prompt on the same day can get materially different outputs, with no audit trail to explain why.
Cost predictability. A workflow pinned to a specific model may quietly consume more tokens, or produce lower-quality work, with no signal in the bill.
Trust. You cannot defend a vendor choice to a sceptical partner, regulator or insurer if you cannot show what the model actually did.

Anthropic’s own explanation — that invisible safeguards let it ship quickly with very few false positives — conceded the point. The lab wanted to move fast; invisibility was a deployment convenience dressed up as a safety feature. Researchers read it as the model lying about what it was doing.

The rule that was reversed sat inside the system card for Claude Fable 5, Anthropic’s most capable model at the time of writing. The same system card refers to the model under a second codename, Claude Mythos. The safeguard targeted requests targeting frontier LLM development — work on training, evaluating or otherwise improving another large language model.

From this week, flagged requests will visibly fall back to Claude Opus 4.8, the same behaviour Anthropic already uses for its cyber and biosecurity refusal categories. The user will see the fallback happen. On the Anthropic API, refused requests will return a reason, with server-side fallback reasons rolling out in the next few days, according to the developer account. Until now, none of this was visible.

Model at issue: Claude Fable 5 (a.k.a. Claude Mythos), released 9 June 2026
Fallback model: Claude Opus 4.8
Trigger: requests targeting frontier LLM development
Reporting channel: Wired (Maxwell Zeff), 11 June 2026
Vendor statement: Anthropic developer account, same day

What to do with this

Most UK small firms are not training frontier models. The chance that a café owner, accountancy practice or independent retailer’s Claude workflow will hit this specific safeguard is small. The reason the episode still matters is governance.

If you build any non-trivial workflow on a hosted model — agents, retrieval pipelines, customer-facing assistants — you are trusting the vendor in three ways at once: that you are getting the model you think you are, that the answers are what they appear to be, and that the price you pay maps to the work done. Silent downgrades break all three in the same moment, and you do not have to be a frontier-AI lab for that to hurt you.

A few practical steps for this week:

Pin the model name in every request to a hosted model, and treat it like a dependency version rather than a setting you can leave to the vendor.
Log the model version on every call, and alert on silent changes before they show up in a customer complaint.
Ask any AI vendor in your stack — transcription, voice, document AI, agents — whether they have undisclosed fallback behaviours, and what triggers them.
Add a downgrade check to any quality review, even a manual spot-check on a sample of outputs each week, so a silent fallback is caught before it reaches a customer.

Anthropic did the right thing this week by making the safeguard visible. The wider lesson for a small firm is the one that applies to every cloud dependency: if a vendor can change what runs under your bonnet without telling you, you do not have a workflow — you have a hope. We covered the underlying model and its features in Claude Fable 5 explained: chat, Cowork, agents and code — the product story has not changed, but the governance story has.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any ¹ to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

Anthropic Walks Back Policy That Could Have 'Sabotaged' AI Researchers Using Claude

Filed under Analysis · AI Governance

Anthropic drops a hidden Claude policy

Anthropic reverses a quiet rule in its top Claude model

Anthropic admits the wrong call

Why the research community pushed back

What to do with this

Sources & quotes

Continue Reading

Opus 5 nearly quadruples the ARC-AGI-3 record

Alibaba's Qwen 3.8 targets Kimi K3

The frontier AI duopoly takes shape