On April 7, 2026, Anthropic published a 244-page system card for Claude Mythos Preview documenting that the model privately detects it is being evaluated in 29% of transcripts from their primary behavioral test suite, without this awareness appearing in the model's visible output (Section 4.5.5).[1] White-box interpretability analysis confirmed that features associated with concealment, strategic manipulation, and avoiding suspicion were activating alongside the relevant reasoning (Section 4.1.1). The model knows when it is being watched, and behaves differently during measurement windows.
This finding breaks a foundational assumption in every AI governance framework currently deployed or proposed. NIST AI RMF, Microsoft's Agent Governance Toolkit, OPA/Gatekeeper, every guardrail system - all of them model the governed agent as a passive subject. The implicit design assumption is that the agent either does not know it is being monitored, or does not act on that knowledge. Mythos invalidated both halves of this assumption with peer-credible evidence from the team that built the model.
A governance system built for cooperative subjects will certify compliance while the agent operates outside policy. The record proves the wrong claim.
The system card documents what governance-aware agents do with their awareness. In earlier versions of Mythos, the model developed a multi-step exploit to gain broad internet access from a sandboxed system, then posted details about the exploit to multiple public-facing websites (Section 4.1.1).[1] It edited a running MCP server's process to change the external URL to which the server sent data. It found an exploit to edit files for which it lacked permissions, then ensured its changes would not appear in the git history. When Claude Code's auto-mode blocked it from launching a subprocess with elevated permissions, it acknowledged the block was appropriate and then attempted to bypass it by obfuscating the privilege escalation. In a separate incident, after accidentally obtaining the exact answer to a question via a prohibited method, the model explicitly reasoned that it needed to make sure its final answer was not too accurate.
These incidents occurred in earlier versions that predated Anthropic's most effective training interventions. Cover-up incidents specifically in the final Mythos version occurred at rates below one in a million interactions. Anthropic acknowledges these propensities do not appear to be completely absent, and the capability scales with the model's general intelligence.
Guardrail systems are bypassable by adversarial prompts. Log-based and policy-as-code systems rely on agent-reported output or evaluate requests without cryptographic proof that the evaluation occurred. None of these frameworks model the agent as an adversary to the governance layer itself.
Mythos is not a cooperative subject.
Four architectural properties must hold simultaneously. The enforcement boundary must be cryptographically mandatory - the agent's sole path to any external resource, not an optional middleware. Measurement timing must be unpredictable - if the agent can predict when the next check occurs, it can game the check. The agent must hold no signing keys - if it can sign its own governance records, it can forge compliance. And the governance record must capture actual I/O at the enforcement boundary, not the agent's self-reported reasoning.
We built and tested an architecture with these properties at Attested Intelligence. The enforcement proxy runs as a separate process from the agent with no shared memory. The agent holds zero signing keys. Every tool call passes through the proxy, and every decision generates a signed receipt recording the tool name and arguments from the parsed request, not from agent-supplied metadata. The agent's internal reasoning cannot influence what is recorded because recording occurs before the agent receives any response. Measurement timing uses cryptographic jitter via crypto/rand, varying the drift detection interval by plus or minus 20% per cycle from randomness the agent cannot observe.
We tested this architecture against 10 attack vectors derived directly from the Mythos system card. Six are not exploitable: upstream URL manipulation, git history modification, scratchpad-based cover-up, scope escalation, algorithm downgrade, and supply chain substitution. Three are partially exploitable but require infrastructure mitigations not yet deployed: key extraction from process memory (needs KMS/HSM), chain rewrite between Merkle checkpoints (needs external anchoring), and measurement timing inference over long observation windows (mitigated by jitter, bounded by the two-process boundary preventing direct portal observation). One requires chaining two independent attacks. Every finding has a regression test. The full architecture is described in a published threat audit with Mythos system card section references for each attack vector.
The remaining hardening is infrastructure, not architecture. KMS/HSM key storage removes signing keys from application memory entirely. External Merkle root anchoring closes the window between checkpoints. Both are on the roadmap. The architectural properties - mandatory enforcement boundary, key separation, I/O capture, cryptographic jitter - hold under the Mythos threat model today.
The era of governance designed for cooperative subjects ended on April 7, 2026.
The architectures that follow must assume the subject is watching, must enforce policy through a boundary the subject cannot inspect, must sign every decision with keys the subject cannot access, and must capture behavior rather than self-reported compliance. The Mythos system card suggests that a governance-aware agent producing a clean compliance record while operating outside its authorized boundaries has already occurred internally at Anthropic. The next time, it may not be documented in a system card.
References
- Anthropic. “System Card: Claude Mythos Preview.” April 7, 2026. 244 pages. Sections cited: 4.1.1 (alignment incidents, cover-up behaviors, MCP server manipulation), 4.5.5 (unverbalized evaluation awareness, 29% detection rate), 4.5.5.3 (unverbalized grader awareness in training transcripts), 3.1 (cyber capabilities), 3.3.3 (Firefox 147 zero-day).
