What SREs Can Learn from Capt. Sully: When to Follow Playbooks
Does it always make sense to stick to your playbooks? There’s no clear answer, but it’s still something you should think about.
January 8, 2025
7 mins
Google SREs are redefining reliability practices with STAMP, addressing the limitations of traditional models as systems scale. Their approach highlights the need for system-wide hazard analysis.
This article is based on the paper released by Tim Falzone and Ben Treynor Sloss, both Google SREs.
It’s been over 20 years since Google outlined what SRE, as we know it today, means. SLOs and error budgets are now widely known concepts, applied at thousands of organizations to manage the reliability of their systems.
However, the complexity of Google over the past two decades has grown exponentially. In 2004, Google, with a fresh IPO, had yearly revenue of $3.2 billion. In 2024, the firm soared over $307 billion in revenue—a 100-fold increase. This revenue is backed by an ever-evolving portfolio of products and offerings, all supported by increasingly sophisticated systems.
While traditional SRE thinking remains valuable and in use at Google, the teams have continuously pushed its boundaries—eventually hitting a limit. Now, Google’s SRE team is adopting a new approach to reliability through STAMP, a framework based on control theory that introduces a fundamental shift in how incidents are approached.
SLOs and error budgets are foundational pillars of the SRE framework in most organizations. While they remain effective for many, Google SREs have encountered limitations when applying them to highly complex and large-scale systems.
Some aspects of critical systems—such as data integrity, privacy, and regulatory compliance—cannot tolerate errors. In these cases, the goal isn’t low-frequency incidents and rapid mitigation, but absolute prevention.
SLOs work well with systems based on stateless web services. As system complexity increases, you start managing sophisticated state and complex dynamics between components.
A lot of incidents stem from the interactions between components that are each working perfectly fine according to their own standards.
Architecture models are one of the cornerstones of SRE at Google. A model that explains the data flow between components is essential to understand potential risks and follow an incident’s logic. However, the way models are commonly built has limitations that complicate SRE practices at scale.
RPC diagrams are the gold standard used to represent systems. While they show the relationship between components, they don’t reveal all their possible interactions. Tim Falzone and Ben Treynor outline a list of questions that highlight the usual gaps in a system model:
It would be impractical to annotate a traditional diagram with all these possibilities, as it would be difficult to capture and read.
As your system gets more complex, navigating its model becomes harder. When you have hundreds of components, it becomes overwhelming to even figure out where to start.
Given the sheer complexity of a system like Google’s, it is quite difficult to maintain a complete and up-to-date version of the model at any given time.
Rather than solely putting out the fires of the day, SREs strive to predict and prevent future failures. However, the introduction of AI and ML in systems makes predictability more challenging.
When an incident begins can be interpreted in different ways. Is it an incident when it’s detected, when it impacts a customer, or when the bug was first introduced—even if unnoticed? Or does it begin when you implemented a CI/CD pipeline that could let the bug through?
This fuzziness and indefinite-regression makes it harder to prioritize tasks and distribute ownership for proactive reliability.
Incidents are often studied as cause-and-effect phenomena—A happened because B happened. But system dynamics are more complex. Each system component has multiple dependencies, and interactions can be influenced by environmental factors.
STAMP is a theoretical framework developed at MIT in the 2000s that applies control theory to system safety.
The core premise of STAMP is that safety can only be understood as a system-wide property, not the property of an individual component. In this framework, ‘accidents’ result from complex interactions between components—not just a linear chain of events.
STAMP also considers more than machine-to-machine interactions. It includes human action and external disturbances.
The shift introduced by STAMP is from “Did A cause B?” to “Which interactions in the system were inadequately controlled for A to happen?” Answering the latter question requires control over your system, meaning you have:
While SLOs help manage risk at the component level, your system ultimately operates in one of two states: normal or loss.
STAMP introduces an additional state: Hazard. A hazard isn’t a single event but a system condition that takes into account worst-case scenarios.
A major disadvantage of traditional SRE approaches is the abrupt switch from OK to Problem. With STAMP, you gain better insight into the system’s condition, allowing you to detect potential hazards in advance.
A system can stay in a hazardous state long before an incident happens—like when a bug is present but untriggered or a server is under-provisioned ahead of a traffic surge.
STAMP serves as the foundation for STPA (System-Theoretic Process Analysis), a hazard assessment methodology used in aviation, manufacturing, and other industries.
Google SREs are applying the STPA Handbook to analyze system interactions and identify ineffective controls at the system level.
By applying STPA to their most complex systems, Google SREs have uncovered hazardous scenarios that could lead to outages. This knowledge enabled them to mitigate issues with quick fixes and long-term engineering efforts.
Google’s work with STAMP is impressive from both theoretical and practical perspectives. However, Google operates at a scale most companies won’t experience.
Even though each STPA analysis requires “little effort,” it typically takes several weeks of engineering work.
If your SLOs and models are reaching their limits, you might be approaching the complexity that warrants non-linear approaches to hazard prevention. Frameworks like STPA and CAST offer ways to operationalize these concepts.