AI safety Prompt injection Cyber capability Model governance Security testing

The Fable ban is really a scope-control warning

Anthropic Fable 5 showed the hard truth of frontier AI safety: stronger coding and bug-finding models are also stronger cyber systems.

NullSquare Research

Security engineering

June 20, 20266 min read

Anthropic launched Fable 5 as a generally available Mythos-class model with extra safeguards around cyber, biology, chemistry, and distillation. Days later, access to Fable 5 and Mythos 5 was suspended after a US government directive tied to a reported jailbreak concern.

The important lesson is not whether one bypass was narrow or universal. The lesson is that frontier AI safety is now a scoping problem: the more capable a model becomes at code, long-context reasoning, and bug analysis, the more directly those same capabilities map into cybersecurity work.

Fable shows the control problem

Fable 5 was designed to route sensitive requests away from the most capable behavior. Anthropic described safeguards that could hand some cyber and science requests to a lower-risk model instead of giving users the full Mythos-class response.

That is a reasonable defense-in-depth pattern, but it also creates a new attack surface. The system must decide what the user is really asking for, what context matters, which tools are in scope, and when a request has crossed from defensive analysis into unsafe uplift.

A classifier can reduce risk without eliminating it.
A narrow bypass can still matter when the protected capability is high impact.
False positives hurt defenders, while false negatives help attackers.

Prompt injection attacks the boundary

NullSquare prompt-injection testing treats an AI product as a workflow, not just a chat box. The target is the boundary between policy, retrieved context, tools, user intent, and the model response.

In that model, a bypass does not need to look like a dramatic jailbreak phrase. It can emerge when the system interprets scope differently across multiple steps. A request can start as review, become transformation, inherit unsafe context, and end as capability routing that the safety layer did not intend.

We do not publish bypass instructions. The operational point is simpler: safety controls must be tested inside the real product flow, with the same tools, documents, memory, and permissions that users and agents actually touch.

Coding capability is cyber capability

There is no clean wall between better coding and better cybersecurity. A model that understands large codebases, reasons across patches, debugs complex failures, and proposes fixes is also better at vulnerability discovery and exploit reasoning.

That dual-use reality is why blanket labels fail. The same task can be safe in one scope and unsafe in another. Finding a bug in an owned repository is defensive. Finding the same bug in a third-party target without authorization can become offensive.

Better code reasoning improves vulnerability reasoning.
Better long-context work improves attack-chain planning and defensive triage.
Better agentic tool use raises both remediation speed and misuse risk.

What safety scoping needs

The clean answer is not to pretend powerful models can be made harmless by one classifier. The clean answer is scoped authorization, continuous testing, auditability, and fast retesting when the model, tools, or policies change.

For companies adopting AI agents, this means testing prompt injection and tool abuse continuously, not once before launch. It means proving which assets are authorized, which actions are allowed, which outputs are blocked, and which evidence shows the guardrail worked.

That is where NullSquare fits: define the authorized scope, exercise the AI workflow like an adversary would, keep evidence for every result, and retest as the system changes. Frontier AI will keep getting better at code. Security programs need controls that improve at the same speed.

Test model safety in the real workflow, not only in isolated prompts.
Bind cyber tasks to explicit assets, roles, tools, and approvals.
Treat every model update as a reason to rerun prompt-injection and misuse tests.

Sources

March 21, 2026

Prompt injection testing has to leave the chat box

May 29, 2026

Evidence quality is the real security signal

May 8, 2026

Private runners make internal testing practical

Back to blog