What SREs Can Learn from Capt. Sully: When to Follow Playbooks
Does it always make sense to stick to your playbooks? There’s no clear answer, but it’s still something you should think about.
April 3, 2024
4 min read
To become battle-tested, you need to go through battles, not just read books or mentor newcomers. Both are helpful but the stakes are low. On the other hand, high stake jobs, such as running a big project or managing a team, are hard to get when you lack experience. So how can we solve this dilemma? Enter incident response.
This post was contributed by Strong Liang. It has been reposted with his permission from the original article on Medium.
Developing leadership skills is not easy. It’s not taught at school, and often not taught at work either. You’re expected to “just do it” . Managers often call out leadership as an area of improvement in their engineers’ performance review, without giving them concrete guidance (I’ve certainly been guilty of it).
I think what’s missing is a lack of leadership development opportunities. To become battle-tested, you need to go through battles, not just read books or mentor newcomers. Both are helpful but the stakes are low. On the other hand, high stake jobs, such as running a big project or managing a team, are hard to get when you lack experience. So how can we solve this dilemma? Enter incident response.
This realization hit me when I observed senior managers (directors, VPs) stepping into major incidents as incident commanders and outshining engineers dedicated to the commander role. Some of these senior managers had very little knowledge about the tech stack, because they were new to the company. Actually, this didn’t matter much, because they were good at leveraging the team. It turned out that the essence of being a major incident commander is leadership.
What’s cool about this role from the leadership learning perspective is that 1) the stakes are high, which stretches you to grow. 2) You get to lead a team of people that you as an engineer normally don’t have a chance to. 3) Many companies need more incident commanders, so even if you are not very experienced, you have a chance to become one.
Back when I was an engineer, I thought leadership was little more than a vague word that could mean a wide range of things, and my takeaway was to simply work harder. However, in a crisis situation, what leadership means is much more concrete:
A group of people sitting in the war room don’t equate a response team. Someone needs to get them organized, which involves assessing the team and finding the right people. A commander would say things like:* “Looks like team A can help here. Do we have someone from team A? Allison, can you page team A right away?”*
This is basically the idea of recruiting—figuring out what skill sets the team lacks and deciding if it makes sense to acquire someone from outside. Regular recruiting is a lot slower and has higher stakes, but you can get practice through “incident recruiting.” You get much quicker feedback on whether you made the right call, and you can learn from mistakes without being stuck with the wrong hires.
Even if the group of people you have are competent, stopping an emergency ASAP is beyond any single individual. A commander must create division of labor and leverage individual strengths. This might look like:
“Sanjeev, you handle external communication. This is a SEV0 so we need an update sent out every 15 minutes.”
“Sunny, you lead the troubleshooting. Our goal is to mitigate the problem first and then look for the root cause.”
The delegation skill is critical to leading a team in general, but it’s hard to practice because, unless someone is already in a leadership position, they normally don’t get many opportunities to do it. And even when they do, the feedback loop takes weeks, if not months, which further slows down learning. During an incident, however, you are delegating left and right, fixing your delegation errors as you go.
Even in a short, one-hour incident, a commander needs to make numerous decisions to steer the team:
Who do we page and when should we escalate?
Do we go with option X or Y?
What should we tell the public and when?
Is the troubleshooting lead doing a good job? Should they be replaced?
While this is stressful, you gain valuable experience while doing it. Of course, the decisions outside incidents usually aren’t nearly as urgent, but a lot of the experience you gain is transferable. You need to have some decision making process, deal with unknowns, and involve the team without getting paralyzed by indecision.
Incidents give the response team exposure to high-stake problems and senior leadership. This is especially true for the incident commander . While some team members can work quietly or hide behind a colleague, you as the commander must speak up, communicating clearly to many different stakeholders. E.g. keeping execs updated with high level details, as well as keeping responders moving with clear, granular tasks. The need to zoom in and out at different layers of a problem under stress trains you to stay composed and effective during difficult situations, hone your character skills and build your credibility, all the things an organization looks for when the going gets tough, like when managing high attrition, tight deadlines.
Obviously, incidents aren’t fun. But when they happen to you, keep in mind it’s not just a bad day at work. The skills that you develop when commanding incidents are actually leadership skills. Incidents are growth opportunities. Never let a crisis go to waste.
Special thanks to contributors: Nicolas Gattig, Ashley Sawatsky, Victoria Tovar
Zhuang (Strong) Liang is a software engineering leader with over 16 years of experience, specializing in Reliability and Infrastructure at world-class companies like Affirm, Google, and Uber. You can keep up with his posts on Medium.