November 6, 2025

7 mins

Lessons from Anthropic’s retrospective.

Quality is the new SLO for SREs to watch out for.

Written by

JJ Tang

Table of contents

Anthropic’s latest incident shows the industry a new category of challenges are coming to SREs. As systems become more sophisticated, failure is harder to detect and, even, define.

Because, if you think about it, did Anthropic have a general outage? Were users experiencing latency? Was there a flaky third-party dependency?

No, no, and no.

The issue was that Claude, Anthopic’s flagship model family, was “dumber”. **

But that’s not something monitoring tools in the market offer help you detect.

Reliability in the age of complex AI infrastructure.

Around August, the Anthropic team started noticing more and more complaints on Reddit, X, and other social media platforms. Conspiracy theories emerged. Frustrated users were accusing the company of worsening the performance of their models, on purpose, in order to charge higher rates for future releases.

The Anthropic team was perplexed. They would never purposefully degrade a model performance- so much work goes into each release. There must be something going on, even though all the dashboards show green lights.

But, how do you assess that a response is dumber than before? Perhaps the user is not providing the right context, or, the task at hand is not equivalent to a previous one and needs a prompt refinement.

And then there’s the matter of scope. Let’s assume there is an issue you haven’t detected and can’t reproduce. How do you find out if the reported degradation is general? Or if it only impacts a specific cluster of users or use cases?

To make matters more difficult, your privacy policies do not allow you to see the content of the queries or the answers that you give to your users. You’re flying blind.

The anatomy of Anthropic’s incident.

When Anthropic published its retrospective, I immediately went through it.

Turns out Anthropic’s Claude ran in a degraded state for about a month. You may ask yourself, how is that possible? Claude has a 32% market share in enterprise adoption, beating OpenAI which is only used in 25% of enterprise deployments.

Todd Underwood, Anthropic’s Head of Reliability, wrote a comprehensive incident postmortem of what had happened and why. It’s is not an easy thing to do, considering that to explain the situation, you have lay down the basics on how LLMs are developed, tested, and deployed to an audience who may not be familiar with the matter.

Between August and early September, Anthropic traced three overlapping issues that degraded model performance for a subset of users:

Context-window routing error: short-context requests were mis-routed to long-context servers.
XLA:TPU output corruption: a misconfiguration introduced token generation errors at the compiler level.
Approximate top-k mis-compilation: a compiler optimization caused the most probable token to sometimes be dropped.

The problems weren’t outages or capacity failures, they were quiet degradations that together impacted user experience but persisted undetected in production.

At Rootly, we see this pattern often across prospective customers and people we talk to in the community: large, distributed systems where everything looks healthy until users start noticing something feels off. Anthropic’s retrospective shows a clear example of how reliability challenges are evolving in the age of AI.

None of these bugs involved model weights or data. They lived in the infrastructure stack: routing, serving, and hardware, which made them especially hard to detect.

Why these incidents are hard to catch.

Claude is used by 19 million users each month. Claude has three distinct models, each with various editions running in production. LLMs are complex, living, systems that fail differently than more deterministic systems. In the AI world, 200 OK responses do not necessarily mean everything is ok.

Additionally, Anthropic operates across multiple environments (AWS, GCP) and hardware platforms (TPU, GPU, Trainium).

In that setup, it’s entirely possible for one region or backend to degrade while others remain fine.

At Rootly, we see this pattern often across prospective customers and people we talk to in the community: large, distributed systems where everything looks healthy until users start noticing something feels off.

One of the most common reliability failure modes we see as we begin discussions with some of our customers with similar challenges:

Heterogeneous hardware: different numerical precision and kernels can subtly alter model outputs.
Multi-cloud deployments: metrics look normal even when only one cloud path is affected.
Limited visibility: privacy constraints and data separation slow down debugging and signal correlation.

By the time engineers can quantify the issue statistically, it’s often been in production for weeks.

Emerging patterns in modern reliability.

What Anthropic encountered mirrors what we hear every week during discussions with the broader engineering community, particularly those running AI inference, ML pipelines, or high-scale SaaS infrastructure.

1. Subtle degradations > hard outages

Availability metrics can stay green while model quality quietly declines.

Monitoring for semantic correctness, not just uptime, is becoming a new reliability requirement.

2. Multi-team coordination defines resolution time

These incidents cut across model-serving, compiler, and infra teams. The faster organizations can establish ownership, the faster they recover.

We consistently see across Rootly customers that cross-functional clarity is the strongest predictor of MTTR.

3. Infrastructure configuration affects correctness

Small changes like routing rules, compiler flags or load balancer updates, can meaningfully change model behavior.

Reliability engineering now needs to include model-output validation as part of deployment testing.

4. Benchmarks don’t reflect reality

Internal regression tests rarely capture live-traffic drift. Real-world feedback (user signals, quality scoring, anomaly detection) fills that gap.

5. Trade-offs must be explicit

Anthropic rolled back a performance optimization because it reduced quality. Mature teams are comfortable trading speed for correctness when the user experience depends on it.

What we’re seeing across the industry.

Looking at patterns across Rootly’s customer base, a few themes are becoming universal for reliability teams:

Incidents are increasingly multi-dimensional. They involve quality, correctness, and system health all at once.
Correlating related signals early saves hours. What looks like separate alerts often shares a common root cause.
Automation reduces friction. Teams that automate ownership, escalation, and context gathering recover faster and communicate better.
Retrospectives drive learning. Structured, repeatable retrospectives help organizations evolve from reactive to preventative reliability.
Transparency scales trust. Publishing honest retrospectives, as Anthropic did, improves reliability culture industry-wide.

Closing thoughts

Anthropic’s retrospective is more than an isolated story, it’s a snapshot of how reliability challenges are changing. Failures are no longer just about downtime; they’re about subtle degradation, cross-team complexity, and rapid learning.

Across Rootly’s customer base, we see the same shift. Reliability is expanding beyond incident response to include data correctness, model behavior, and cross-system visibility.

As AI infrastructure becomes more complex, the organizations that succeed will be those that treat reliability not as reactive chaos, but as an ongoing process of observation, correlation, and learning, exactly the mindset reflected in Anthropic’s transparency.

‍