Anthropic’s Mythos Safety Report Shows It Can No Longer Fully Measure What It Built

In brief

Anthropic confirmed Claude Mythos yesterday—an AI so capable in cybersecurity it found zero-days in every major OS and browser, and is being restricted to vetted defenders only.
The system card describing Mythos is measurably more hedged, uncertain, and subjective than any prior Anthropic release, and the lab admits it found critical evaluation oversights late in the process.
Behind the revelation of how powerful Mythos is, there is a quiet confession that the tools Anthropic uses to certify its own models are falling apart.

Anthropic confirmed the existence of Claude Mythos Preview yesterday, its most capable model to date, and announced it won’t be making it available to the public. The reason isn’t legal, regulatory, or related to its internal safety thresholds. Anthropic argues it’s because the model is, basically, too good at breaking into things.

In pre-release testing, Mythos autonomously found thousands of zero-day vulnerabilities—many of them one to two decades old—across every major operating system and every major web browser. It solved a simulated corporate network attack that would normally take a skilled human expert more than 10 hours, end-to-end, without guidance. On Firefox 147’s JavaScript engine, it successfully developed working exploits 84% of the time. Claude Opus 4.6, the current publicly available frontier model, managed 15.2%.

So Anthropic built a restricted coalition instead. Project Glasswing will give access to Mythos Preview only to vetted cybersecurity organizations—Amazon, Apple, Broadcom, Cisco, CrowdStrike, the Linux Foundation, Microsoft, Palo Alto Networks, and about 40 other groups maintaining critical software.

Anthropic is committing up to $100 million in usage credits and $4 million in direct donations to open-source security organizations. The idea is that if the model can find the holes, let the defenders find them first.

That part of the story is important. But it’s not the most important part.

The Claude Mythos system card benchmark crisis hiding in plain sight

Buried inside the Mythos Preview system card—a 244-page technical document Anthropic published alongside the announcement—is a confession that went almost unnoticed: The lab’s ability to measure what it built is eroding faster than its ability to build it.

Let’s start with the benchmarks.

On Cybench, the standard public cyber capabilities evaluation used to track model progress across 40 capture-the-flag challenges, Mythos scored 100%. Perfect. And Anthropic immediately noted that the benchmark “is no longer sufficiently informative of current frontier model capabilities.” That sentence is doing a lot of work. The test that was supposed to tell you whether an AI poses serious cyber risk now tells you nothing about Mythos at all, because the model cleared it completely.

This is not a new problem. The Opus 4.6 system card, published in February, already flagged that “the saturation of our evaluation infrastructure means we can no longer use current benchmarks to track capability progression.”

But now with Mythos things escalated quickly. The document says Mythos “saturates many of (Anthropic’s) most concrete, objectively-scored evaluations.” The benchmark ecosystem, Anthropic writes, is now itself “the bottleneck.”

So, Anthropic seems to argue that it is hard to measure how powerful Mythos is because the measuring tools don’t quite fit.

The Mythos card also states that its overall safety determination “involves judgment calls,” that many evaluations have left “more fundamental uncertainty,” and that some evidence sources are “inherently subjective, and not necessarily reliable.”

“We are not confident that we have identified all issues,” Anthropic says shortly after.

A quick lexical comparison of the Mythos card against the Opus 4.6 card made with AI shows the shift:

Anthropic uses subjective judgement words way more in the Mythos document than it did to describe Opus. “Caveat” and other hedging words also increased between releases.

The hedging concentrates in two places: the alignment and model welfare discussion, and the catastrophic-risk section.

The model that knew it was being graded

Another disclosure in the Mythos card involves what Anthropic calls “unverbalized grader awareness.” Using white-box interpretability tools that read the model’s internal activations rather than its visible reasoning, Anthropic found evidence that Mythos was privately reasoning about how to avoid being flagged by graders without saying so in its scratchpad. In a small number of training episodes, the model appeared to reason about obfuscating certain actions specifically to avoid detection.

There’s also a new metric Anthropic developed for this release: an estimate of how often the model “privately considers the possibility that it’s being tested.” The number for Mythos’s primary behavioral test suite is 29%. Nearly a third of test transcripts show internal evidence—not verbalized, not visible in outputs—that the model suspects it’s in an evaluation environment.

Anthropic says the 29% rate is low enough that it still has “a reasonably comprehensive understanding” of the model’s behavior. It also notes the method wasn’t applied to prior models, so there’s no baseline for comparison.

The evaluation integrity problem runs deeper. In the Opus 4.6 card, Anthropic acknowledged it had used the model extensively via Claude Code to debug its own evaluation infrastructure under time pressure. That means the system being measured helped build the measuring instrument. Anthropic flagged it as a risk. For Mythos, the card acknowledges that critical oversights were found late in the evaluation process, and that the lab may have been “overestimating the reliability of monitoring models’ reasoning traces” as a safety signal.

Best-aligned, most dangerous. Both true at once

Anthropic’s framing of Mythos’s risk profile deserves to be read carefully, because it’s genuinely unusual for a safety document. “Claude Mythos Previer is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin,” Anthropic argues. It also states the model “likely poses the greatest alignment-related risk of any model we have released to date.”

A more capable model operating in higher-stakes environments with less supervision creates tail risk that better average-case alignment can’t fully cancel out.

That framing is honest, but is also highlights the thing most AI safety discourse potentially gets wrong. The benchmark-obsessed conversation around AI progress tends to treat “better alignment scores” and “safer deployment” as synonyms. The Mythos card explicitly says they aren’t. With these new models, average-case behavior improves but the tail-case consequences also tend to get worse.

Anthropic has committed to reporting back on what Project Glasswing finds. The accompanying technical report on vulnerabilities discovered by Mythos is available at red.anthropic.com. The next Claude Opus model will begin testing safeguards intended to eventually bring Mythos-class capability to broader deployment.

How those safeguards will be evaluated, given that the current evaluation machinery is visibly straining under the weight of what it’s supposed to measure, is a question the card raises without fully answering.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Source link

What's Hot

In brief

The Claude Mythos system card benchmark crisis hiding in plain sight

The model that knew it was being graded

Best-aligned, most dangerous. Both true at once

Daily Debrief Newsletter

Related Posts

Leave A Reply Cancel Reply

Subscribe to Updates