September 17, 2024

8 mins

5 Proven Strategies to Reduce MTTR

Long-lasting downtimes can have costly consequences for your organization. By reducing your Mean Time to Resolution (MTTR), you limit potential revenue loss and reputational damage.Learn the best practices used by top SRE teams, from communication and automation to tracking the right data.

Written by

Jorge Lainfiesta

Table of contents

Reliability is a long-term journey. eBay’s first notable incident occurred in 1999, when their platform was fully unavailable for 22 hours—the company lost $3.29 million (equivalent to $6.10 million in 2024), and its stock price tumbled. Since then, the company has come a long way. eBay’s latest significant incident was only partial and was resolved within an hour. Decades of substantial technology investments and continuous reliability improvements have made eBay a model in the industry, boasting a 99.99% availability even during traffic spikes.

Reducing your Mean Time to Resolution (MTTR) has direct business consequences. Whether through lost potential revenue, compensations caused by SLA breaches, or damage to brand reputation, incidents can negatively impact any organization. That’s why teams invest in technology, staff, and processes that enable them to recover as quickly as possible after an incident occurs.

In this blog post, you’ll get an overview of what MTTR is and the factors that can affect its performance. You’ll also learn best practices and proven strategies that have worked for more than a hundred SRE teams.

What Is MTTR?

Mean Time to Resolution (MTTR) is a key reliability metric that focuses on the time it takes your team to resolve an incident. To calculate it:

Select a relevant period to evaluate (e.g., the last 30 days, last quarter, etc.).
For each incident in the selected period, measure the time that passes between when an incident is marked as acknowledged and when it is marked as resolved.
Calculate the mean time to resolution by summing all the individual resolution times and dividing by the number of incidents.

Your incident management tool should be able to automatically calculate this and other metrics for you.

The shorter your MTTR, the less likely your systems are to experience prolonged downtime or degraded performance. It also demonstrates your SRE team’s maturity, as they are trained to react quickly and effectively with the right incident response tools and processes in place.

Reducing your MTTR is a sign of progress toward greater reliability, but it cannot be trusted as an absolute measure. A single exceptionally long-lasting incident can skew your data, providing an inaccurate picture of how your team manages incidents. Always dig deeper when evaluating your overall reliability.

Factors That Impact MTTR

Your Mean Time to Resolution (MTTR) is a high-level metric, meaning it can indicate trends but cannot be used to make specific decisions in isolation. MTTR is influenced by many factors, some within your control and others outside it.

Escalation Processes

Once an incident is acknowledged, the responder will assemble a response team to restore systems to normal as soon as possible. However, deciding who should be part of that team is not always straightforward.

Modern infrastructures are distributed and manage hundreds of software components. This means on-call responders face a lot of complexity, even when just figuring out where to start. You’ll need extensive knowledge of your system to fix it, especially when all you have is an error trace and some logs.

Gaining additional context on the incident, such as graphs from Datadog, can help determine which components are impacted. From there, you can check who is on call and familiar with those components.

Streamlining the response team formation process can help reduce MTTR, as this often takes significant time. Tools like Rootly AI can suggest responders who have managed similar incidents in the past, expediting the team formation process.

Root Cause Analysis

Sometimes you can apply temporary fixes to mitigate the impact of an incident, but ultimately, you must figure out what caused it. Understanding what introduced the error or took a system down is the first step toward a resolution.

However, performing a Root Cause Analysis requires experienced SREs who can navigate complex logs and delve into the infrastructure and codebases of other teams. According to Steve McGhee, Reliability Advocate at Google, this is where SREs showcase their most valuable skills: debugging others’ code and building a mental model of the system to fix it.

Teams like Meta are experimenting with AI to reduce the time to perform a Root Cause Analysis. Their approach uses AI-assisted filtering to narrow the search space for responders, making it easier to find the root cause.

Communication Across Teams

Communication during an incident is vital but can quickly become problematic as you coordinate a response and keep everyone in the loop. Inefficient collaboration and poor communication across teams can hinder incident resolution, especially in large enterprises.

You must not only coordinate and track who in your response team is doing what, but also keep stakeholders informed, work with legal representatives, collaborate with customer success teams, and manage PR.

Streamlining communication workflows and automating where possible can significantly reduce MTTR. Responders can collaborate more effectively, while stakeholders receive timely updates on what they need to know.

Fragmented Tools and Context Switching

On-call engineers already deal with enough complexity in trying to restore a system. Yet, many SRE teams also manage a fragmented set of tools, forcing them to switch between multiple apps.

For example, using an on-call solution like PagerDuty only alerts you to an issue but leaves you to figure out what to do next. Modern solutions like Rootly consolidate the entire incident management process—from alert to retrospective—so responders can focus on resolving the incident rather than managing the process itself.

How to Drastically Reduce MTTR: Best Practices for SREs

Create a Comprehensive Incident-Management Action Plan

The first step in your reliability journey is addressing each incident with an ad-hoc approach. Over time, you’ll notice patterns that help resolve incidents faster and more effectively. As your services and team scale, you’ll need repeatable processes to handle incidents. Crafting a comprehensive incident management plan is essential for building a mature reliability practice.

Start by basing your incident management plan on templates found in SRE books and resources. However, ensure your plan is tailored to your team’s experience, the incidents you typically face, and your system’s architecture.
Define the roles within your incident response team, so each member is empowered to excel without overlapping responsibilities.
Include role-specific steps for each incident, especially those related to security and compliance.

Improve Incident Communication

As your team and services grow, so does the complexity of communication. Incidents, with their urgency and ambiguity, exacerbate this complexity.

Avoid introducing unnecessary communication channels. Stick to the tools your organization already uses, such as Slack or Microsoft Teams, for managing incidents.
Use an incident management tool like Rootly to keep communications organized within your preferred collaboration platform. Rootly’s bots connect natively with Slack and Microsoft Teams, allowing your team to focus on resolving incidents while automating the reporting process.
Keep an incident response communications playbook on hand to ensure stakeholders, partners, and customers receive the relevant information promptly and clearly.

Leverage Automation

While every incident is unique, there are common workflows and processes involved in each one. Tools like Rootly make it easy to set up no-code integrations with over 70 other tools, reducing the burden on responders.

Keep systems in sync with automation. For example, let your incident manager automatically sync the status of a Jira ticket with your incident Slack channel.
Bring in data from various sources for incident evaluation. Rootly can automatically pull relevant Datadog dashboards or notify leadership when specific incident criteria are met.
Use AI to shorten feedback loops. Rootly AI can answer questions about ongoing incidents and generate context-based summaries, saving valuable time for SREs.

Track and Analyze Data

To reduce MTTR, you need to monitor its evolution and evaluate other key performance indicators related to reliability.

Focus on metrics that are meaningful to your business. Avoid vanity metrics that are easy to measure but don’t provide real value.
Review metrics systematically at regular intervals. High-level metrics only yield useful insights when evaluated periodically.
Understand the limitations of metrics, including MTTR. Seasonal factors or extraordinary incidents can skew results, leading you down unproductive paths.

Leverage the Right Tools

Simplifying the work of your responders is crucial for reducing MTTR. Remove the friction caused by suboptimal tools or fragmented workflows.

Use modern incident management and communication tools to streamline the entire process, from detection to resolution.
Ensure your incident management tool integrates with all the platforms your team relies on.

How Rootly Can Help You Reduce MTTR

Rootly is an on-call and incident manager trusted by leading reliability teams like LinkedIn, Cisco, NVIDIA, and Webflow. Rootly offers incident management bots for Slack and Microsoft Teams, allowing responders to manage incidents directly from these platforms. Our solution tracks your MTTR and other incident metrics, which you can analyze through detailed dashboards.

Book a demo with one of our reliability experts to see how Rootly can help your team reduce its MTTR.