July 10, 2025

11 mins

Owning Reliability at Scale: Inside the Hybrid Incident Models

How should you structure your incident response team? From severity-based escalation to role-driven orchestration, hybrid models are helping teams scale reliability and balance resources.

Written by

Jorge Lainfiesta

Owning Reliability at Scale: Inside the Hybrid Incident Models

Table of contents

I went to SREDay earlier this year to give a talk on incident response and rescue operations. I learned a bunch and connected with great people in the community. A talk that made an impression on me was A story of 100s of 1000s of SLOs across the globe, by Panos Moustafellos, Distinguished Engineer at Elastic (fun fact: Elastic uses Rootly to manage incidents). In his talk, Panos described how Elastic SREs manage 60 geo regions, 10s of thousands of customers, and 300,000+ SLOs with a team of only 58 engineers.

Slide SRE Day how Elastic manages 100,000+ SLOs

What does it take to make that possible? I asked Panos after the talk. It was not easy, he explained. We talked about culture and SRE team models as key enablers for scalable reliability.

That conversation stuck with me. It raised a broader question: how do different organizations structure their reliability efforts to scale? A few weeks later at Rootly, we hosted a Reliability Leaders Roundtable to dig into exactly that: centralized vs distributed incident response. We brought together SRE leaders and engineering executives from across the industry to share how their teams are structured, where they’ve struggled, and what’s worked as they’ve scaled.

In this blog post, I’m bringing together insights shared by industry leaders during the roundtable, along with notes from my own work and conversations with other engineers in the field. I’ll start by outlining what centralized and distributed incident response models look like, then dive into the different ways teams are blending the two into hybrid approaches.

Centralized Incident Response

A centralized incident response model is one where a dedicated team, often trained incident commanders, takes over coordination and communication responsibilities across the organization.

Centralized teams enforce consistent standards, manage stakeholder and customer communications, and ensure follow-up processes like postmortems are completed. One participant said, “The centralized model is when it really starts to be beneficial, especially as you get a bigger organization,” noting that it's hard to “decentralize those standards and practices across engineering” at scale.

Centralization also provides continuity amid org changes and supports executive-level visibility: “There’s a real need by higher executive leadership to have business data and it’s really hard to set a standard if every team is doing their own thing.” Ultimately, centralized models provide process depth, clarity, and accountability.

Pros of Centralized Incident Commands

Expert coordination: Dedicated incident commanders know the incident model inside out and can guide teams through high-pressure scenarios.
Clear stakeholder communication: Centralized teams ensure high structure of communication expectations to customers and executives are met.
Supports organizational change: Easier to manage incidents across team reshuffles, time zones, and complex org charts.

Cons of Centralized Incident Commands

Scalability limitations: Central teams can become bottlenecks in very large orgs unless heavily resourced.
Lower team ownership: Engineers may disengage or view incident response as someone else’s responsibility.
Overhead & formality: Process-driven environments may introduce friction or slow initial responses.

Distributed Incident Response

A distributed incident response model pushes ownership and execution of incident handling to individual service-owning teams.

In this model, “we empower the service-owning teams to handle that themselves, they run it themselves.” The rationale is that these teams know their service better. Distributed models support autonomy and scale by giving engineers control, which aligns with DevOps principles: where you feel the pain is where you focus your effort to fix things.

However, challenges include inconsistent documentation, siloed communication, and limited experience. As one speaker warned, “You can’t expect an engineer in a distributed model to remember the 64 steps the centralized team did, they might only run one outage a year.” Another noted that teams often “downplay severity” to avoid paperwork or scrutiny.

Pros of Distributed Incident Commands

Team-level autonomy: Service-owning teams run it themselves and resolve faster thanks to their deep context.
Scales with engineering growth: Works well when many teams must operate independently.
Ownership & learning: “You feel the pain, so you fix it,” promotes accountability and long-term system understanding.

Cons of Distributed Incident Commands

Inconsistent standards: Trying to decentralize those practices... gets more difficult the bigger the org gets.”
Poor documentation discipline: Engineers don’t want to open incidents because of the paperwork.
Context sharing is hard: Skill transfer and escalation paths become messy without centralized connective tissue.

Enter Hybrid Incident Management Models

The hybrid approach combines the strengths of both centralized and distributed models, adapting based on incident severity, urgency, and organizational maturity.

Centralized and distributed incident response models each have their strengths, but neither is a perfect fit for every team. In practice, most organizations land somewhere in between: mixing elements from both to fit their size, maturity, and the types of incidents they handle.

That’s where hybrid models come in. These setups aren’t one-size-fits-all; they adapt to the needs of the team and the moment, aiming to balance speed, ownership, and coordination in the middle of chaos.

Hybrid by Incident Severity

In this setup, lower-severity incidents (like SEV 3s and 4s) are owned and resolved by the service-owning teams, empowering them to act quickly and independently. For higher-severity incidents (SEV 1s and 2s), a centralized incident commander steps in to manage orchestration, stakeholder communication, and follow-up. As one participant explained, “We empower the service-owning teams to handle that themselves but when it goes to SEV 1 and SEV 2, we bring in a dedicated incident commander.”

Importantly, this model isn’t rigid. Escalation can be dynamic and automated as an incident progresses. “Sometimes you start off with a SEV 3 and all of a sudden it escalates pretty rapidly to a SEV 1,” one leader noted. Tooling helps smooth the transition: “when responders bump the incident up to a SEV 2 or SEV 1, Rootly automatically pages out an incident commander.” This responsive hybrid system ensures the right level of coordination is activated at the right time, enabling organizations to scale incident response while protecting autonomy where appropriate.

Hybrid by Role Specialization

This hybrid model draws a clear line between technical ownership and coordination responsibilities during an incident. Engineers act as incident responders, diagnosing and resolving issues within their systems, while a rotating group of dedicated incident commanders focuses on orchestration, communications, and post-incident processes.

As one participant explained, “We use a rotation of four incident commanders. They gather resources, drive the investigation, and lead the RCA.” This allows service teams to stay focused on restoration without being burdened by process or executive communication. Early roles are fluid: “Whoever’s the first person on is the incident commander, whether they know it or not, until they find someone to take it.”

This model is particularly effective in organizations that want consistent coordination without removing service teams from the reliability loop. It also helps newer engineers gain confidence while ensuring experienced leaders handle stakeholder management and systemic accountability.

Hybrid by Communication Surface

In this variation, technical resolution is distributed, but communication remains centralized. While engineering teams investigate and fix issues, a dedicated function, typically incident commanders or customer-facing teams, controls messaging to internal and external stakeholders.

One leader described it this way: “We centralize all the tracking and recording. It helps enforce the process and ensure resolution.” Communication to customers, especially, is handled with care: “We tie in our customer service team to make sure what we’re saying is accurate but doesn’t create a panic.”

This model recognizes that clear, consistent messaging, especially in high-pressure incidents, is a specialized skill. It ensures executive and customer-facing updates are delivered with the right tone and frequency, while allowing engineering teams to stay focused on mitigation. For companies where reputation and trust are business-critical, this hybrid is a natural fit.

Hybrid by Organizational Maturity or Team Profile

This model adapts based on the readiness and reliability maturity of individual teams or business units. Less mature or less experienced teams operate under a more centralized approach to ensure process rigor and mentorship. More experienced teams, once they’ve demonstrated strong ownership and postmortem practices, are gradually given more autonomy.

“The less mature the organization, the more centralized you probably want to have things,” one participant explained. Another added, “We try to push it down to teams, but only when they’re ready, as some teams downplay severity to avoid the follow-up.”

This approach acknowledges that distributed responsibility isn’t a default, it’s earned. It gives leaders flexibility to build organizational trust over time, letting process maturity evolve alongside team growth. In this sense, it’s not just a model but a progression toward distributed reliability at scale.

Conclusion: Making Your Incident Response Work

There’s no silver bullet when it comes to structuring incident response. Even teams that lean heavily into centralized or distributed models still find themselves adapting on the fly. The hybrid approach isn’t just a compromise. Rather, it’s a system that evolves with your organization.

A few themes stood out. First, clarity matters, especially in fast-moving incidents. “You have to have someone on incidents who's making decisions, and once you get somebody who's making decisions, usually things start moving along a lot better.”

Second, empowerment must be intentional. One leader noted, “There’s a lot of engineers that don’t necessarily feel empowered to make big decisions, even when it comes to ‘do I fail back or fail forward?’” Centralized models help fill that gap; distributed ones require careful investment in culture, tooling, and trust.

The takeaway? Don’t think of your incident response model as a fixed template. Think of it as a reliability strategy: one that flexes with severity, scale, and human context. The best teams aren’t just reacting to incidents, they’re designing a system that helps everyone show up ready when it matters.

‍