How to Build a Successful On-Call Culture in 2024: Tips, Best Practices
Good culture can make your team 30% more productive. But how do you cultivate it in the on-call context? Read what SRE leaders are recommending.
August 22, 2023
6 min read
Hans Chung refers to the tendency for SREs to independently zoom in on one task or problem at a time, and the consequences that come with it, as the “solo hero pattern”. In this post, he explores some of the reasons it happens, and what SRE leaders can do about it.
Let’s be honest. When you see an alert pop up on your phone, you aren’t thinking “according to section 12 of our most recent SRE handbook used at training 6 months ago I need to keep in mind who should be Incident Commander and who should be Ops Lead”. You’re an engineer at heart. You think “Is this a false alarm like the last 5 times?”, “Is it functional area Y or dependency Z again?”, “How do I hand off or save the state of the other thing I’m working on right now?”. In this post, I’m going to refer to this tendency to independently zoom in on one task or problem at a time, and the consequences that come with it, as the “solo hero pattern”. I’ll explore some of the reasons it happens, and what we, as SRE leaders, can do about it.
While we preach the 3 C’s of SRE, one of the biggest milestones in an incident is getting the right people in the room quickly. When that alert comes in, you probably don’t know what that list of folks should be. You need to dig in and validate that there is an incident. Your mind goes into diagnosis mode. The docs you look up are related to the diagnosis at hand. Your cognitive capacity is starting to get monopolized. You pull up logs and charts. Now you are tunnel-vision focused on how big the blast radius is and you are keeping an active mental note of hypotheses of root cause or mitigation options. If it’s a major incident, that’s clear. If not, then you think “Hey, I can stop the bleeding now, I just need a minute to do X, then I’ll update the incident since it isn’t major”. When that quick fix doesn’t work, that starts another rabbit hole of intellectually stimulating problem solving. You get into a flow state. Time doesn’t move in the same way.
While you are in that awesome flow state, you max out your cognitive capacity. You aren’t thinking about and recognizing which SRE policy applies right now and the full details of that policy. You might not even know the latest policy because you had food poisoning on training day. A simple policy could have been “If you haven’t validated blast radius in 20 minutes, then you need more hands on deck from rotation X”. But you lost track of time…it doesn’t feel like 20 minutes.
This is reality. As SRE leaders we need to acknowledge that what looks like a solo hero pattern exists and that it is rooted in an SRE flow state. So it isn’t reasonable to expect purely human-powered incident response to be consistent with policy.
You could brute force it with staffing up to always have an independent Incident Commander, but show me an SRE team that isn’t understaffed. There isn’t a perfect solution for this but there are dimensions that we work with: Tooling, Training and Staffing.
As a product manager for a legacy-based incident response tool (note: more on implications of tool origins another time), I knew that SREs considered interactions with it to be reporting overhead with very little help in actually mitigating or resolving an incident. SREs should not feel that they work for their incident response tool.
Tooling needs to first help the solo SRE in eliminating the need to recall policy and perform mundane tasks. Why should an SRE be tracking elapsed time? Why should an SRE have to look up an on-call rotation to figure out who to page? Why should an SRE have to look up and read through a playbook for all the aliases to notify and to craft a message? These should all be automated to free up SREs to do the diagnosis, mitigation and resolution. In an ideal world they should not have to proactively provide updates that repeat or summarize what they’ve discussed or what they’ve done. As more and more machine learning (ML) is implemented in incident tooling, I expect we’ll see this type of task being performed automatically.
I know that ML can be just another dependency. I’ll cover graceful tool degradation in another post...and with many a war story to go with it.
One of the best ways to build a natural sense of who needs to be in the room is to walk the shoes of all those other folks throughout an incident during training. What does it feel like to be a PR person that’s been brought in 4 hours after an incident started? What does it feel like to be an executive to be caught off guard by a call from a VIP client asking you why your product is broken and making them lose 100s of millions? Or at the very least have all involved teams do joint fire drills. Realistic training scenarios also give SRE leadership a glimpse into which policies are hard to recall in a timely manner and with consistency. Let’s get realistic about how many policies an SRE can keep in their head fresh and recall in a timely and consistent manner.
We also need to be realistic about staffing. I’ve seen services with bizarre rotation setups such as two rotations; one for one part of the technology and one for another. The bizarre part is that one rotation’s secondary was the other rotation’s primary and vice versa. They were not experts in each other’s parts. The ‘outsider’ secondary’s role was to go and find someone on the original team to take an active role in the incident. How bizarre is that? I’ve even seen cross-team setups where the foreign secondary refuses to respond. We can do all these kinds of imaginary solutions or we can get realistic with the staffing required to achieve an SLO. This goes beyond SRE to the other teams that get pulled into incidents, such as PR and legal. If there is a SLO around external communications (for example, regulatory requirements) then you need to staff those functions with rotations and have response time policies.
Beyond capacity, one of your bottlenecks will be the number of senior SREs that know the service intimately in ways that you can’t expect junior SREs to. Services evolve and relevant design assumptions or behavior limits don’t always get written down.
SRE principles are often unrealistic in high SLO environments. Relying on training and Slack alone isn’t going to cut it. We have to learn to embrace some default SRE behavior modes and work with or around them. Tooling such as automation and training can address a large portion of the problem. A limiting factor in reliability will be staffing. We need to set realistic SLOs about the tools, training and capacity we have to work with. A continually violated SLA or inconsistent incident experience erodes customer trust. You can keep fooling yourself, but customer dollars usually end up walking away.
About the Author
Hans Chung recently worked as Product Manager for Alphabet’s homegrown incident response and management tools (IRM and Public Status Dashboard). Hans has seen how teams and organizations ranging from Google Cloud and Maps to YouTube and Search think about SRE, and what really happens during incidents. Beyond tooling, Hans was also involved in influencing SRE principles to be more enterprise-aware.
Rootly makes life easier for SREs by automating incident management in Slack. Learn how Rootly can streamline your organization's incident management by booking a free, personalized demo today.
{{subscribe-form}}