

The Complete Guide to AI SRE: Transforming Site Reliability Engineering
AI SRE brings AI to incident response, root cause analysis, and remediation, reducing on-call load and improving reliability outcomes for teams.
*Chris Inch is an engineering leader with 20+ years of experience leading teams at scale at companies like Shopify and Wealthsimple (a Rootly customer).*
This is part 3 of a 3 part series on maturing incident management practices.
In previous articles, I shared the cost of immature incident management and how to improve it by investing in tech proactively. In this article, I will pay closer attention to the responders behind the scenes: the actual humans doing the hard work, and how to make sure you’re not sacrificing great people to keep your systems up and running.
Even with the advancements we’ve made through tooling, automations, and AI, there are still humans behind the scenes to respond to alerts, handle disruptions, and investigate the most complex issues.
Incident management maturity hinges on these humans and, particularly, the foundations of:
These human elements are the pillars of effective, and sustainable, incident management process. Let me share how I’ve cultivated them across teams throughout my career.
If you take only one thing from this article, please remember this: prioritizing human wellness in your incident management process will lead to faster incident resolution, stronger teams, and reduce turnover of skilled people.
Let’s face it: Responding to emergencies is a draining job.
Even just being “on-call” and never receiving a page is still a burden for most people. So the more that we can do to consider and improve wellness among your incident response team, the better.
When I first joined Wealthsimple, the rotation of Incident Commanders was quite small. The Incident Commanders were a group of people who were responsible for responding to every incident that occurred, helping to coordinate subject matter experts and communicate to stakeholders. Because the group was so small, most people on the rotation were tired and looking for a break. It was draining Incident Commanders and leading to burnout for individuals within the group. Even worse: when someone wanted to leave the rotation, it meant even more burden for the remaining people.
To make this situation better, we took action in a few ways:
First and foremost, we trained dozens of people of varying skills to respond to incidents. The people that joined the rotation were not your regular SRE-type people. They were developers, managers, directors, and even some from Product and Customer Support. We trained these individuals to communicate and coordinate incidents effectively and drive them toward resolution.
The result of increasing participation in the rotation was that it had immediate relief for longstanding participants, it created deeper learning of our systems for more people across many teams, it shared context across engineering, and it helped to develop strong communication and leadership skills for the people commanding incidents. Of course, the tradeoff we made is that more people have to be on-call in general, but with dozens of Incident Commanders available, this burden meant 24 hours of on-call every couple months—much less burden on any one individual.
The second thing we did was to give a regular break in being on-call, each quarter. When this happened, some people were taken off rotation and others were rotated into the cohort of active incident commanders. We called these “tours of duty”. It was a natural cadence of allowing people to fully find relief of the burden of being on-call, a time to celebrate the folks who had just completed a tour, and also to allow some people to opt-out if they needed a longer break away. (We never asked questions when someone needed a longer break.) This cadence also allowed us to train up new incident commanders at regular intervals, and upgrade skills of existing commanders as needed.
A note on team on-call and pager rotations: On-call schedules should also be equitable rather than equal for all. Encourage everyone to work on a schedule that works best for everyone. If someone is taking a night-school course, or has a new baby at home, it may work better for them to take more daytime or weekend shifts and allow a night owl on the team to cover the overnight on-call rotation. Work this out with your teams in order to find a solution that allows everyone to be on the top of their game when responding to potential incidents.
As discussed in the first article in this series, it’s so easy to turn to the individuals who have the most experience and track record of fixing issues when dealing with bugs, incidents, or disruptions. But heroes—just like the rest of us—are human too, and spoiler alert: they will not be around forever.
While there is often a case to be made to have your star players respond to all incidents, it can also be a giant source of organizational fragility when the hardest stuff always goes to one or two people. A longer-term solution is to build strengths across all individuals.
In addition to distributing some of the on-call responsibilities to more individuals, it’s also worth the time to train more people in some of the areas that your heroes know best. It rarely seems like there is time to proactively train people in complex areas of your systems, but it can be highly valuable for everyone to have a deeper understanding of how things work.
Shadowing is a great way for people who are less familiar with an area to learn from someone who is more familiar. This can be achieved by having shadow participants on on-call rotations. They take on the same duties as the main responder, however, they observe rather than drive during an incident call.
This allows them to soak up all the nuances of incidents and the code base, as well as how to communicate and resolve disruptions. Even better than shadowing is reverse shadowing, which is when the less experienced person gets to actually sit in the driver’s seat and the experienced partner observes an incident from the sidelines, but is around to offer help when needed. This reverse shadowing has the additional benefit of building confidence and expertise in the newbies through hands-on experience.
Sometimes, organizations treat incidents as speed bumps or detours on their journey to release amazing products. These incidents are inconveniences that are passed over, but once they’re in the rear view mirror, they never look back. On the other hand, a stronger organization will see every incident as an opportunity to learn and become stronger. By interrogating the incident, teams can extract valuable insights to make their software and their systems stronger, more resilient, and more reliable.
The concept of a “blameless retrospective” (or blameless postmortem) has existed for quite some time and is used by many companies to uncover what happened during an incident without blaming people and centering conversations on the systems that exist that caused the incident to happen. However, it is the trust, the openness, and the safety that are the real ingredients of a successful retrospective.
For example: A bad retrospective can still create fear, break trust, and fail to effectively communicate even if you don’t name names and call it “blameless”. On the other hand, an excellent postmortem can create safety for individuals, trust on the team, and complete openness to share and learn together, and this might be done while talking directly with those who were involved in the incident. The huge difference here is the prework that you’ve done to create a culture of safety and trust ahead of time.
Over this series of blog articles, I have been using the word “maturity” to discuss the evolution of incident management. This is very intentional to highlight that your approach should match your company’s stage of growth. It is a mistake to make significant changes to process and tooling while you are still trying to find product market fit. But it’s also a mistake to neglect your incident management until it’s too late, everything’s on fire, and your people are burnt out.
In the first article, we looked into common pain points of incident management and the costs that it may be having on your technology and your team. In the second article, I discussed ways to leverage technology to give superpowers to your teams, and in this third and final article, I presented ways to ensure you are prioritizing the humans that actually keep everything running smoothly.
By using these suggestions as jumping-off points, you will find what works for you and your team. Taking deliberate action to make improvements will reduce incident frequency and severity, lead to faster resolution times and perceived reliability. It can also simultaneously reduce responder fatigue and exhaustion. Lastly, if you approach every incident as a way to learn and become stronger as an organization, you can enhance the collective knowledge of your team and ultimately create a more resilient organization and more reliable software.