Cixin Liu, the famous science fiction author, is pretty good at conveying the scale of the universe, which is usually hard to imagine. For example, the sun is about 149 million kilometers (or 93 million miles) away from the earth, but that doesn’t tell me much other than yeah, it’s kind of far. When Cixin explains that if the sun was the size of a soccer ball, the earth would be a speck of dust floating meters away from it, then I can grasp a better sense of the scale we’re dealing with.
When I worked at Shopify, we processed around 675 billion transactions yearly. If each kilometer that separates the earth was a transaction, you could go to the sun nearly five times with Shopify’s processing scale. It’s a silly analogy, but it helps put things into perspective.
Enterprise scale is not something abstract but a hard-to-imagine reality unless you experience it first-hand. Enterprises have unique challenges that only happen at that scale. You won’t see their problems at a start up or a mid-sized company; worrying about them is pointless until you have to deal with them.
When your company is larger than entire nations, how does incident management look like? When you’re a start up, you’ll get a headache if your service is down for a few hours. You’d be six feet under within a week if you took that approach in an enterprise: a sunny day can see dozens of ongoing incidents across the organization.
In this article, you’ll get an overview of the kind of challenges to expect when dealing with incidents at scale, how to identify gaps in your reliability strategy, and how to measure the success of your incident management.
These insights are based on my experience working with enterprise partners at Rootly, such as LinkedIn, NVIDIA, and Cisco. I’m also drawing best practices from my conversations with reliability leaders from Google, Microsoft, and Okta on my Humans of Reliability series.
Enterprise Incident Management Challenges
Reliability at Scale
When you have a single service, you could say your company is reliable if that service is available, let’s say, 99.9% of the time. What does it mean to be reliable when you have thousands of services supporting different business lines in different regions?
- Defining reliability: Defining reliability standards through Service Level Objectives (SLOs) is an exercise in which enterprises invest because it’s typically linked to a direct financial impact. Salvatore Furino, Reliability Engineer at Bloomberg, explains that part of his day-to-day job is to talk with teams to refine and monitor existing SLOs. At enterprises, typically, there’s at least a yearly review of SLOs, where they are assessed, swapped, or removed accordingly.
- Mixed maturity across services: Services are developed, maintained, and run by different teams—usually a plethora of them. It’s expected that some will perform better than others regarding reliability targets. That doesn’t mean some teams are better than others. System availability can be a result of the service's complexity, demand, tech stack, infrastructure, and many other parameters.
- Global distribution: Let’s say you set a global error budget of 1% for Service A, which is used by customers across 100 locations. If Service A is available 100% of the time in 99 locations but has serious issues in 1 location, your criteria would not raise an alert even though a segment of your users is experiencing critical disturbances. At an enterprise scale, you’ll need to add geographic location as an extra parameter to all your problems.
- Working with reliability partners at scale: All the partners you work with to enable observability, alerting, on-call scheduling, and incident management must be able to keep up with your reliability standards at your scale. For example, Rootly is the only multi-cloud alerting solution for enterprises running on both AWS and GCP and offering a 99.99% SLA.
{{subscribe-form}}
Effective Communication During Incidents
Does the CTO need to know about every incident? Probably not. But how do you determine who actually needs to know when you have hundreds of colleagues from different teams and functions loosely connected to every one incident you run into?
- Escalation policies at scale: Escalation policies at enterprises look like fractals. If you could look at the global policy and start looking into each level, you’d notice that each is composed of dedicated teams. And each team has its own escalation policy, likely with subteams at each level and so forth. Enterprises set criteria on when and how incidents should be escalated according to each level, and it might be a vertical escalation towards more senior roles or a horizontal escalation towards different functions.
- Bringing in the right people: One of the most challenging issues of resolving an incident is figuring out who are the right people who can take care of it. At scale, you may get an alert on your service, but the issue may spread way beyond what you know about the overall system. Dozens of people commit hundreds of diffs daily on what could potentially be causing the incident. Teams like Meta are using AI to narrow the search for a root cause to simplify this process, while incident managers like Rootly offer AI to find teammates who have solved similar incidents in the past.
- Collaborating with non-engineering teams: Incidents can have a direct impact on the business, which may mean the responders need to have a guideline on when to contact other departments. For example, if an issue is impacting key customers, their account managers should be informed as soon as possible so they can handle the situation on their end while the incident is mitigated. Or, perhaps support needs to be informed as a flock of requests and questions may be coming their way due to a problem in the returns system.
- Dealing with external communications: It is unlikely that a responder can just go ahead and update the status page when an incident breaks. Enterprises make sure to safeguard how they communicate incidents to the outside world. Most messages about an incident will have to go through Legal and Public Relations.
Managing On-call For Large Teams
You have to have people on-call for hundreds of services around the world. This translates to having dozens of rotations scheduled to cover all services across all time zones.
- Delegated on-call planning: It would be impossible to have one person planning the on-call rotations for everything at an organization. Thus, teams organize themselves on how to keep a 24/7 coverage over their scopes, using any popular on-call scheduling strategy. However, not anyone should be able to modify anybody else’s schedule. That’s why enterprise on-call solutions like Rootly offer detailed RBAC support, which can even be controlled via Terraform.
- Service-based rotations: The most common case is to organize your on-call coverage around specific services. You want to make sure there’s somebody available to fix any issues in the payments service as soon as it happens. So you set up primary and secondary rotations to make sure there are people available to fix any eventuality.
- Team-based rotations: However, enterprises often want to page teams independently of a service. It may be the team responsible for a module using a new tech stack that nobody else is familiar with, or a team of database engineers because they just migrated to a new schema. Legacy on-call solutions like PagerDuty do not support this natively, which leads to a lot of manual work by administrators. Most modern on-call alerting solutions do support team-based rotations though.
Navigating Security Challenges
Few things are as critical as security for enterprise teams. Everything is on the line: data breaches, stock price collapses, class actions. Which makes businesses at scale go all out to ensure their processes, and vendors, are secure and compliant.
- Dedicated security-related workflows: Most enterprise teams have dedicated processes and protocols when they detect a security incident. Incident response tools like Rootly let enterprise teams define specific workflows for security incidents, which include tailoring an incident declaration form.
- Principle of least privilege: Alerting and incident management should be subject to enterprise security practices, like the principle of least privilege. With Rootly, you have granular control over read/write access to each aspect of on-call and incident response.
- Working with secure vendors: Being SOC2 compliant is only a baseline requirement for any vendor in 2024, but it’s not enough to work at an enterprise scale. Rootly is designed for security teams and won the “Overall Security Orchestration, Automation and Response (SOAR) Solution Provider of the Year" in 2023 by the CyberSecurity Breakthrough Awards, on top of a handful of industry certifications.
Automations
Implementing automations at an enterprise company is not just about improving efficiency anymore. Automations around incident management are a necessity to speed up mitigation times and prevent compliance issues.
- Automations require dedicated implementations: The number of systems, vendors, and processes used by enterprises means automations require implementation work. You’ll have SRE engineers dedicating substantial efforts to setting up implementations to connect the dots between your various tools and processes.
- Seek vendors who act as partners: No matter how “easy” the automations offered by a vendor are, they’ll rarely work out of the box for the sophisticated use cases of an enterprise. Prioritize on-call and alerting solutions, like Rootly, that are ready to partner with your team to develop a solution that works for you.
{{cta-demo}}
Assessing Your Current Incident Management Strategy
You don’t arrive at an enterprise scale without some sort of incident management solution. Whether it’s an ad-hoc collection of utilities that were put together as the need arose over time or a legacy vendor you inherited from the previous management, your teams are somehow managing and resolving incidents.
However, just as your team keeps shipping new products, paying off tech debt, and renovating their infrastructure, your alerting and incident management solution needs to keep evolving to meet new demands and objectives.
Evaluating Existing Tools and Processes
Before you decide something is broken or works well, you’ll need to get a good understanding of how incidents are managed at the moment. This is not a simple task. It’ll require you to go into the trenches to see how people on-call are responding to incidents.
- Map the actual processes being used: Incident management is a complex process involving a variety of parties. You want to capture what is happening at the moment, not how you envision it or how it was designed. There are several ways of mapping enterprise processes. You may not need to do a full report unless you want to secure some budget in front of stakeholders, but the guiding principles can be useful. You may want to adhere to the framework used in your organization, or at least its ideas.
- Make a vendor inventory: You’ll be surprised by the number of vendors that you have a contract with in loose connection with incident management. Write them down, and try to get a realistic impression of how much they’re used and their cost. It is common to find unused seats or that a different tier could work better in some cases.
- Divide and conquer: Trying to get a global view of how incidents are managed across the organization is ambitious. Start by focusing on a few teams that are particularly relevant to you and scale from there.
Identify Gaps and Areas of Improvement
After mapping your existing incident management process, some gaps will become apparent. The next step is about prioritizing and planning how to actually improve your process.
- Talk to your responders: Talking with the people involved in the process about what’s bothering them the most can help clarify where you can start adding more value in the incident resolution process. Often, problems that represent big pain points can be improved with relatively simple changes. In other cases, you won’t be so lucky, and you may need major overhauls.
- Plan with adoption in mind: You cannot change incident management at an enterprise level from one day to the next. You’ll need to get buy-in from different teams, and you’ll likely have to support parallel solutions while you deprecate legacy software. All changes must have an intentional adoption path to avoid risking ending up with a new setup that nobody uses.
Enterprise Incident Management Best Practices
Implement a Centralized Enterprise Incident Management Platform
Tool sprawl can impact your reliability by forcing responders to juggle a variety of disconnected tools and context switches to address an incident. Opt for a centralized platform that integrates alerting, on-call schedules, incident management, and retrospectives in one place, such as Rootly. This will help you save investments in integrations and consolidate your practice.
Ensure Effective Communication During Incidents
Incidents can be hectic and have many people interested in their evolution. Your incident management software can do a lot for you to keep communications streamlined. First of all, solutions like Rootly bring clarity by becoming the single source of truth for any incident-related insight. Plus, you can offload communication tasks through automatic workflows that, for example, notify certain Slack channels when a SEV1 incident is declared in a specific service.
Automate Incident Response
Enterprise teams invest heavily in automation. The amount of manual work involved in incidents is surprising, especially after they’ve been resolved. Retrospectives can take time as you gather the facts after the fact. Incident managers like Rootly automatically keep track of what happens throughout your incident resolution process and can construct a timeline, suggest resolution causes with GenAI, and file action items in Jira.
Leverage AI
AI can significantly enhance incident management by providing data-driven insights, predictive analytics, and automated decision support. When leveraging AI, it’s crucial to ensure that it complements human judgment rather than replacing it. AI should be used to process vast amounts of data quickly, identify patterns, and suggest possible actions, leaving the final decision-making to experienced incident responders who can interpret the context and nuances of each situation.
Rootly’s AI capabilities provide advanced insights and suggestions, helping incident responders resolve issues faster. With a privacy-first approach and human-in-the-loop design, Rootly’s AI supports decision-making without replacing human judgment, ensuring that your incident response is both efficient and effective.
Regularly Review and Update Processes
Incident management needs to evolve along with its context. Make sure you set a yearly cadence to check the entire process and tools. The review would ideally include inputs from responders and the results from the period of evaluation.
Measuring the Success of Your Incident Management Strategy
Key Metrics to Track + Rootly’s Benchmark Data
Deciding which SRE metrics are useful to you and which ones aren’t is the first step to evaluating the performance of your reliability strategy. Even then, what constitutes “good” or “needs improvement” will entirely depend on your context and SLOs.
To provide some orientation, the team at Rootly processed the anonymized data of about 150,000 high-severity incidents across enterprise-tier customers, excluding accounts with more than 5,000 employees because they distorted the dataset disproportionately.
MTTR Benchmark
The Mean Time to Resolution / Recovery in our dataset, from detection to recovery, is distributed as follows:
- About 8% of these incidents were mitigated in less than 30 minutes.
- About 22% were mitigated within 1 hour.
- About 15% were mitigated between 1 and 2 hours.
- The remaining incidents (55%) took more than 2 hours to mitigate.
Follow-up Actions
Using the same dataset, we evaluated incidents in the following state: all follow-up action items completed and a retrospective published.
- 8% completed in less than 1 week.
- 28% completed between 1 and 2 weeks.
- 23% completed between 2 weeks and 1 month.
- 16% completed in more than 1 month.
- The remaining incidents (25%) were missing data or incomplete, implying they were not done or still in progress.
Tailoring Metrics to Your Needs
Rootly allows organizations to build custom dashboards. This is vital as not every team inside of large enterprises needs to or cares to see metrics from other teams. Rootly customers often have a dashboard specific teams will leverage (e.g., the mobile app team views their most common incident cause drivers). Customization is powerful but easy to use.
Leveraging Technology in Enterprise Incident Management
Choosing the Right Incident Management Software for Your Enterprise
The incident management software you choose to partner with to develop your reliability strategy can make a big difference in your implementation. Ease of use paired with the right amount of configurability, cost-effectiveness, and the ability to integrate with your existing tools play a critical role when choosing an incident management solution.
Rootly caters to true enterprises globally such as LinkedIn, NVIDIA, Figma, Elastic, Cisco, and Shell. We specialize in large-scale enterprise deployments such as LinkedIn with 10,000 users, and the majority of our customers are large global enterprises with 5,000+ users on Rootly.
Conclusion
Building an incident management strategy at scale requires you to continuously assess how your responders are working and how effectively you’re hitting your SLOs.
Check out our full-cycle Incident Response Guide to get more insights into each step of the process.
{{cta-incident}}