

The Hidden Costs of Immature Incident Management
The start of a journey towards a mature SRE practice.
Chris Inch is an engineering leader with 20+ years of experience leading teams at scale at companies like Shopify and Wealthsimple (a Rootly customer).
This is part 2 of a 3-part series on maturing incident management practices. In the first post, I discussed the hidden costs of immature incident management. In this article, I will get into some of the ways that you can leverage tools and technology to strengthen your incident management.
Before we get into some of the nitty-gritty details, it’s important to realize that there is no cure-all solution that will work across all organizations right away. Change is difficult. It takes time to realize the gains of proactive efforts. You have to keep walking in the direction of improvement and lay one brick at a time. Over time, you will have fewer incidents, faster response and resolution times, and a team that is less burdened by recurring issues.
The first (and often most effective) spot to start adding tooling is any place where it is someone’s job to watch a dashboard and sound the alarms when something looks wrong. This is what monitors and alerts do extremely well, and any observability platform worth its salt will have great monitoring capabilities right out of the box.
At both Shopify and Wealthsimple, my team and I used Datadog as our primary observability tool. We had thousands of monitors in place that would alert us if there was ever a problem that needed attention. It is a good idea to add alerts for anything that provides a clear signal of a problem and that can also be actioned. On the other hand, don’t add alerts if your monitor is noisy or cannot be actioned. You want clear, actionable alerting or you’ll risk alert fatigue.
Another area to look for potential improvements is anywhere heroes are using bespoke tools or solutions when investigating issues. A “one size fits one” solution to observability is not a great approach. If you have to log into a production console to understand what is happening during an incident, then it is time to have a proper logging and tracing suite in place across your entire stack. Doing so will allow your entire team to have a consistent way to diagnose issues and find solutions to problems together.
When adding monitors, remember to monitor the “symptoms” instead of the “suspected causes” of problems. Black-box monitoring tends to be more reliable and is better at catching issues of all sorts. What I mean by this is that it’s better to monitor whether your system is returning 5xx errors to users who are experiencing real issues than to monitor for high CPU on your database servers, which may or may not lead to issues for your end users.
A very common question that development teams encounter is whether they should perform manual tasks or add automation. While in some cases automation can be time-consuming to add and require ongoing maintenance that is often not worth the effort, in the case of incident response, automation can mean the difference between downtime for your users and a top-notch, rock-solid experience for everyone—not to mention giving your on-call responders a good night’s sleep.
The first step to adding meaningful automation is to start manually and create runbooks for your team to follow during a disruption in order to restore functionality. These runbooks are simple, step-by-step instructions that anyone on the team can follow when responding to an incident. If you already have these documented, then you’ve done most of the heavy lifting already.
As long as your runbooks have clear execution steps and rule-based criteria for initiation, they are worth automating. Automations can execute nearly instantly and help a system recover gracefully with zero or near-zero downtime.
Where you add automation depends on the tools your team already uses internally.
One example of automation that we used extensively at Wealthsimple was canary deployments in Argo Rollouts. This deployment strategy promotes new code changes to a small fraction of replicas (eg: 10%) before rolling them out entirely. The small canary deployment then assesses whether there is any negative change in behaviour for end users, typically higher latency or an increase in errors. If either of these worsens, the canary is rolled back automatically and teams are alerted that the deployment was unsuccessful. On the other hand, if everything works well for at least 10 minutes, the remaining replicas are replaced with the new code and the deployment succeeds. This is a very effective way to limit the blast radius of incidents and automate the resolution of would-be problems.
It wasn’t too long ago that a significant portion of my own workweek was spent in postmortem meetings, assessing contributing factors, looking for recurring incidents, and creating executive summaries after an incident occurred. All of these things are now areas where AI can help.
Just about every tool available today offers some sort of AI functionality. You should be embracing these offerings as much as possible, especially if they help you diagnose and resolve issues quickly for your users.
For example, Rootly is an incident management platform that I’ve used extensively. It has a very useful AI feature that summarizes incidents while they are in progress. This is useful for anyone joining an ongoing incident who needs to catch up. Rootly quickly and accurately summarizes the incident and updates the new responder without distracting from the work in progress. It’s a very effective way to add superpowers to your incident response.
AI is changing software development and incident response as we speak. New solutions are emerging every week to assist with many parts of the traditional incident responder job. This is a great thing, and AI will most definitely make your team more consistent and empower responders to restore functionality more quickly during an incident. And really, that’s the main goal.
AI is really good at assessing logs, traffic, and other data to quickly detect anomalies that deviate from everyday levels. Not only that, but AI is also helping teams collapse alerts together so that only the most meaningful, high-signal alerts are fired. This greatly reduces alert fatigue that can be experienced by responders.
Leveraging technology to help your team is something organizations of any size can benefit from. Even small start-ups can begin adding tools and automation today. Identify the most impactful areas that require the least amount of effort and build from there. Whether your engineering department is 6 people or 600, there are always ways to make your offerings more reliable through the use of tools and automation—not to mention less of a burden for your team.
Computers are really good at following instructions, and they don’t mind being asked to help at 3 AM. The use of good observability tools, monitoring, alerting, AI, and automation can drastically reduce the time, money, and reputational damage caused by incidents.
In part 3 of this series, we’ll address ways to make your incident response more sustainable for the humans behind the computers.