AI Resiliency

Fable 5 Didn’t Go Down. It Got Switched Off.

Everything in my DR runbook assumes things break. You retry, you fail over, you wait it out, because the thing is coming back. Fable 5 wasn’t sick. It was switched off. There is no exponential backoff for an export directive. I can evacuate a region in an afternoon. I cannot appeal a national-security directive from my laptop.

Twice a year I help run disaster recovery game days for our AWS infrastructure. We pick a quiet afternoon, page the on-call, and break something on purpose. Kill an availability zone. Fail the primary database over to the replica. Cut a region out entirely and watch which services notice and which ones quietly lie about being healthy. You learn a lot about your architecture when you're the one holding the knife.

In years of doing this, here is a scenario I have never once drilled: the dependency is completely healthy, and we are not allowed to use it.

That happened in the real world on June 12.

On June 9, Anthropic shipped Claude Fable 5, its most capable publicly released model, built for long-horizon agentic work and live the same day across the API, AWS, and Microsoft Foundry. Seventy-two hours later it was gone. Not slow, not rate-limited, gone. A US government export-control directive on June 12 prohibited use of Fable 5 and its sibling Mythos 5 by anyone who isn't a US national. Anthropic can't check your passport in real time across every contract, employee, and cloud delivery path, so it did the only thing available to it. It suspended the models for everyone, everywhere, with no restoration date.

Sit with that if you build software for a living. The status page never went yellow. Your most capable dependency was switched off by someone who doesn't work at your vendor and never signed your SLA.

We drilled for the wrong failure

Everything in my DR runbook assumes things break. A region goes down, a node falls over, latency spikes past the threshold and the alarms go off. Every one of those is a dependency that's slow or sick. You retry, you fail over, you wait it out, because the thing is coming back. The whole discipline rests on a quiet promise: the dependency wants to be available and is just temporarily unable.

Fable 5 wasn't sick. It was switched off. There is no retry that fixes "a government said no." There is no exponential backoff for an export directive. The model isn't behind a degraded load balancer, it's behind a legal wall with no published door. I can evacuate a region in an afternoon. I cannot appeal a national-security directive from my laptop.

That is a failure class that was never in the game day, and most production architectures don't model it.

Why this should bother regulated shops in particular

I work in fintech, which means the pitch in every room right now is some version of "let's put the best available model in the workflow." Tax document processing, check-image classification, agentic back-office work. The better the model, the more headcount and latency you take out, so the incentive is to standardize on the ceiling.

Fable 5 just showed that the ceiling can be nationality-gated in three days. And the failure modes piling up around frontier models are not the ones in my runbook:

Jurisdiction. Model access is now an export-control question, the way it has always been for advanced chips. A commercial model, launched and pulled in 72 hours, on national-security grounds. That belongs in the vendor risk register now, not in a hypothetical.
Data retention. Fable shipped with mandatory retention requirements that complicated its rollout with partners like Microsoft. In a regulated context, "where does the prompt go, and for how long" can disqualify a model no matter how good it is.
Provenance. Plenty of products shipped built on Fable during its short window. When the model vanished, so did the thing standing on top of it.

None of these are outages. None of them get fixed by reliability engineering. They are governance failures wearing an availability costume, which is exactly why they slip past teams who are good at availability.

What I'm taking back to the next game day

Treat your top model like a dependency that can be disabled, not just slowed:

Abstract the model, not just the call. If your code knows that Fable runs always-on adaptive thinking and has its own Messages API shape, you've welded yourself to one model's quirks. Route through an internal canonical interface so swapping the model underneath is a config change, not a rewrite. Same lesson as not hardcoding a single AZ, one layer up the stack.
Keep a real fallback chain, not a fantasy one. Fable to Opus 4.8 to Sonnet 4.6, each tier tested under load before you need it, not the morning you need it. Anthropic's own guidance when the music stopped was to fall back to Opus. The teams that had that wired ate a quality dip. The teams that hardcoded Fable ate an incident. I have watched the difference between a tested failover and an aspirational one play out at 2 a.m., and it is not subtle.
Build to a capability floor, not a ceiling. Design the workflow so your worst acceptable model still clears the bar, and treat anything above that as upside you're allowed to lose. If the system only works with the best model on earth, you don't have a system. You have a bet on one vendor's one SKU staying legal.
Put "model unavailable for non-technical reasons" in the continuity plan. Not the DR plan for the data center. The plan for the day the model is perfectly healthy and you still can't touch it. That is a tabletop exercise I'm finally going to run.

The cloud taught us that machines fail, so we engineered for machines failing. We got good at it. Fable 5 taught us that permission can fail, instantly, globally, with no ETA and no replica to fail over to. That is the variable to design around now.

Opus 4.8 is excellent and it's still here. Build on the model that's still here, and assume the one you love most can be taken away by someone you've never met.