Top 9 Skills for SREs from ex-Instacart SRE
A list of the top nine SRE skills, from incident management, to cloud computing, to networking and beyond.
September 17, 2024
8 mins
Long-lasting downtimes can have costly consequences for your organization. By reducing your Mean Time to Resolution (MTTR), you limit potential revenue loss and reputational damage.Learn the best practices used by top SRE teams, from communication and automation to tracking the right data.
Reliability is a long-term journey. eBay’s first notable incident occurred in 1999, when their platform was fully unavailable for 22 hours—the company lost $3.29 million (equivalent to $6.10 million in 2024), and its stock price tumbled. Since then, the company has come a long way. eBay’s latest significant incident was only partial and was resolved within an hour. Decades of substantial technology investments and continuous reliability improvements have made eBay a model in the industry, boasting a 99.99% availability even during traffic spikes.
Reducing your Mean Time to Resolution (MTTR) has direct business consequences. Whether through lost potential revenue, compensations caused by SLA breaches, or damage to brand reputation, incidents can negatively impact any organization. That’s why teams invest in technology, staff, and processes that enable them to recover as quickly as possible after an incident occurs.
In this blog post, you’ll get an overview of what MTTR is and the factors that can affect its performance. You’ll also learn best practices and proven strategies that have worked for more than a hundred SRE teams.
Mean Time to Resolution (MTTR) is a key reliability metric that focuses on the time it takes your team to resolve an incident. To calculate it:
Your incident management tool should be able to automatically calculate this and other metrics for you.
The shorter your MTTR, the less likely your systems are to experience prolonged downtime or degraded performance. It also demonstrates your SRE team’s maturity, as they are trained to react quickly and effectively with the right incident response tools and processes in place.
Reducing your MTTR is a sign of progress toward greater reliability, but it cannot be trusted as an absolute measure. A single exceptionally long-lasting incident can skew your data, providing an inaccurate picture of how your team manages incidents. Always dig deeper when evaluating your overall reliability.
{{cta-incident}}
Your Mean Time to Resolution (MTTR) is a high-level metric, meaning it can indicate trends but cannot be used to make specific decisions in isolation. MTTR is influenced by many factors, some within your control and others outside it.
Once an incident is acknowledged, the responder will assemble a response team to restore systems to normal as soon as possible. However, deciding who should be part of that team is not always straightforward.
Modern infrastructures are distributed and manage hundreds of software components. This means on-call responders face a lot of complexity, even when just figuring out where to start. You’ll need extensive knowledge of your system to fix it, especially when all you have is an error trace and some logs.
Gaining additional context on the incident, such as graphs from Datadog, can help determine which components are impacted. From there, you can check who is on call and familiar with those components.
Streamlining the response team formation process can help reduce MTTR, as this often takes significant time. Tools like Rootly AI can suggest responders who have managed similar incidents in the past, expediting the team formation process.
Sometimes you can apply temporary fixes to mitigate the impact of an incident, but ultimately, you must figure out what caused it. Understanding what introduced the error or took a system down is the first step toward a resolution.
However, performing a Root Cause Analysis requires experienced SREs who can navigate complex logs and delve into the infrastructure and codebases of other teams. According to Steve McGhee, Reliability Advocate at Google, this is where SREs showcase their most valuable skills: debugging others’ code and building a mental model of the system to fix it.
Teams like Meta are experimenting with AI to reduce the time to perform a Root Cause Analysis. Their approach uses AI-assisted filtering to narrow the search space for responders, making it easier to find the root cause.
Communication during an incident is vital but can quickly become problematic as you coordinate a response and keep everyone in the loop. Inefficient collaboration and poor communication across teams can hinder incident resolution, especially in large enterprises.
You must not only coordinate and track who in your response team is doing what, but also keep stakeholders informed, work with legal representatives, collaborate with customer success teams, and manage PR.
Streamlining communication workflows and automating where possible can significantly reduce MTTR. Responders can collaborate more effectively, while stakeholders receive timely updates on what they need to know.
On-call engineers already deal with enough complexity in trying to restore a system. Yet, many SRE teams also manage a fragmented set of tools, forcing them to switch between multiple apps.
For example, using an on-call solution like PagerDuty only alerts you to an issue but leaves you to figure out what to do next. Modern solutions like Rootly consolidate the entire incident management process—from alert to retrospective—so responders can focus on resolving the incident rather than managing the process itself.
The first step in your reliability journey is addressing each incident with an ad-hoc approach. Over time, you’ll notice patterns that help resolve incidents faster and more effectively. As your services and team scale, you’ll need repeatable processes to handle incidents. Crafting a comprehensive incident management plan is essential for building a mature reliability practice.
As your team and services grow, so does the complexity of communication. Incidents, with their urgency and ambiguity, exacerbate this complexity.
While every incident is unique, there are common workflows and processes involved in each one. Tools like Rootly make it easy to set up no-code integrations with over 70 other tools, reducing the burden on responders.
To reduce MTTR, you need to monitor its evolution and evaluate other key performance indicators related to reliability.
Simplifying the work of your responders is crucial for reducing MTTR. Remove the friction caused by suboptimal tools or fragmented workflows.
Rootly is an on-call and incident manager trusted by leading reliability teams like LinkedIn, Cisco, NVIDIA, and Webflow. Rootly offers incident management bots for Slack and Microsoft Teams, allowing responders to manage incidents directly from these platforms. Our solution tracks your MTTR and other incident metrics, which you can analyze through detailed dashboards.
Book a demo with one of our reliability experts to see how Rootly can help your team reduce its MTTR.
Learn proven strategies for preparing, identifying, and resolving incidents efficiently