Podcast: Break Things on Purpose with Gremlin | Building Rootly with JJ Tang
Our co-founder JJ reflects on building the fastest-growing incident management platform and the surprising learnings.
September 4, 2024
7 mins
Cultivate a blameless culture and leverage automations to transform failures into learning opportunities. Discover how to implement effective retrospectives and download a free template to get started.
Incident postmortems can make or break companies. For example, Crowdstrike’s postmortem document was anxiously awaited by lawyers looking to sue and by brokers wanting to understand the risk of buying stock. The same happened with Unisuper and Google Cloud after a massive customer-facing incident in Australia.
Granted, your average postmortem (or retrospective) document won’t attract that much press attention. But it’s still fundamental for your team to keep improving reliability and developing their skills as SREs.
Chris Ferraro, now VP of Platform Engineering at Garner Health, recalls the time he took down Microsoft as the worst yet most formative moment in his career. “I've never had that many people in a room who are all concerned about what I was about to say,” explains Chris. The retrospective his team worked through was so good, it “certainly made me the man I am today,” he recalls.
Postmortems are not meant to be a mere bureaucratic requirement or about finding a guilty party. Through retrospective processes, teams make an effort to learn about flaws in their software development lifecycle (SDLC), infrastructure setup, and many other aspects of their organization.
Postmortem, or post-mortem, comes from a Latin term that could be translated as “after death.” It’s commonly used in medicine to refer to a study or action performed on a dead body.
The connotations are a bit too dark for me, plus the word frames incidents as necessarily catastrophic and negative situations. Instead, I always use the word retrospective and recommend it to the SRE teams I coach.
But what is a postmortem or retrospective, anyway?
Is it a big meeting where everyone is frowning? A long and boring document nobody will ever read? Quite the opposite!
Ideally, a retrospective is a process triggered after an incident is resolved or mitigated. It’s a time to reflect intentionally on what went wrong, how it was fixed, and how to prevent it in the future.
There will usually be a meeting or two (but without frowning faces, in a blameless culture), a document (which is actually useful and will be read!), and a set of actions to be taken (showing that each incident actually improves your reliability).
Incident retrospectives may have a lot of benefits, but let’s face it: your response team is busy. They’re either putting out new fires, tackling backlog tickets, or dealing with something in Prometheus. Asking them to write postmortem reports after every incident might not be realistic and can end up causing retrospective fatigue.
During a recent roundtable, reliability experts weighed in on retrospective frequency. Some leaders explained they have guidelines on when an incident requires a postmortem, for example, if certain services were affected or based on the level of severity.
Others find that having the Incident Commander decide whether the incident merits a retrospective is the right path for them.
Each organization has different objectives and challenges. Your team can experiment with different rhythms to find one that works better. But keep in mind that retrospectives do not have to be all the same. You can use different templates for different kinds of incidents.
{{subscribe-form}}
Retrospectives are not one-size-fits-all. Each team needs to experiment with the way they conduct postmortems and be ready to adapt their process on a case-by-case basis. However, these are some best practices high-performing SRE teams recommend:
Sorrel Harriet, technology learning consultant and speaker, explains that the high levels of stress inherent to incidents impact our ability to learn from them. Thus, a first step to making incidents a learning opportunity is ensuring people feel safe to report mistakes and to share ideas and feedback.
During a retrospective, your team’s objective is to understand why a failure occurred. By default, people will start pointing fingers at each other, as it’s the “easiest way” to find a root cause: finding a guilty person or team. To combat this, set guidelines for the conversation and make sure the retrospective moderator keeps the discussion away from blaming anyone.
Having a uniform way of running postmortem meetings and documenting findings is good for many reasons. It promotes a blameless thought process, and it helps ensure your team covers all bases. However, you may not always need a dozen people sitting in a room to conduct a retrospective. Some incidents call for a scaled-down retrospective, while others will have you performing deeper investigations.
Instead of having your team jump around outdated Confluence pages, codify your retrospective process in a tool like Rootly. You’ll be able to set up defined retrospective templates with steps that your team can go through easily but with enough flexibility to adapt to each case.
Constructing a timeline for the incident can be a time-consuming task after an incident is resolved. Modern incident managers like Rootly keep track of key events that happened throughout the incident and construct a timeline for you. You can also hook up other tools related to your retrospective, like Jira, to file the actions that come out of the process.
Retrospective processes vary from organization to organization. Factors that may impact whether there are more or fewer steps may even come from compliance requirements depending on the business line and incident type. However, these are the basic steps any organization will go through during a retrospective process.
Before you call a bunch of people for a meeting, you have to make sure all the materials you’ll need are available to everyone. You may need to get logs, set up dashboards with metrics, and construct an incident timeline. Gathering any information about the incident context, how it was handled, and the impact will be useful. Share your materials with the team before calling everyone for a retrospective meeting.
The complexity of an incident can easily steer any discussion away from the original goal. Make sure you establish roles for the retrospective call. Include a moderator, a note-taker, and somebody in charge of logging the actions to be taken as a result of the retrospective. You can go further and assign roles based on function to bring their input on the infrastructure or legal matters of the incident.
When you sit down with your team to understand what happened, keep in mind you’re trying to build a blameless culture. I recommend starting by walking through the timeline to get a neutral read on what you know. Go with facts: events that happened, indicators that were captured, actions that were taken, and their outcomes.
Encourage participants to share their perspectives openly, always with respect for their peers. The goal is to get a better understanding of what happened and why. Nobody has the absolute answer; each person is trying to bring a piece of the puzzle.
Ensure the moderator is empowered to reorient the discussion when it starts going adrift into discussing a technical detail that needs a different meeting on its own or when somebody starts suggesting that what happened was someone else’s fault.
By the time you have this meeting, you likely already know which software component failed and why. Perhaps it was a misconfiguration deployed by mistake or an incompatible update in an indirect dependency. Whatever it was, you want to get to why this issue happened.
A common framework for understanding the incident root cause beyond the technical pin is the 5 Whys technique. The idea is to get deeper into the issue and touch on processes and structures. After describing the root cause you know, ask your team why they think it happened. Then ask again, and again, and again.
You’ll likely uncover unknown assumptions or areas that were left untouched for far too long. The idea is to pin down reasonable changes that can be made so this kind of incident doesn’t happen again in the future.
Incident retrospectives should not become a way of patting yourself on the back. You need to drive improvements from postmortems. That may come in many shapes, from scheduling shadow on-call rotations to reinforce your response process to implementing significant changes in your CI/CD pipeline.
No matter what it is, make sure the team leaves the postmortem meeting with action items that can be executed and tracked. If you find out you need a deeper dive into a system, make sure scheduling that is registered as an action item. Don’t let things hang in the air: nothing that can’t be tracked will be done.
Anonymized data from 150 thousand incidents managed through Rootly revealed that around 41% of the follow-up actions that emerged during an incident remain undone after a month. Recording action items during a postmortem process is not enough to improve your reliability.
You’ll need to figure out a way of following up with important tasks and checking on their progress. You’ll often have to help your SRE team reprioritize work so they can focus on the action items that need the most attention.
Retrospectives require coordination and significant manual work. By using a modern incident management tool like Rootly, you can simplify the process for them while making more consistent retrospectives.
Rootly also offers smart retrospective templates, dozens of automations, and 70+ integrations with the tools you already use (Confluence, Google Docs, Notion, and more).
LinkedIn, NVIDIA, Figma, Cisco, and Elastic are among the companies using Rootly to manage their retrospectives. Book a demo with one of our reliability experts to see how Rootly can help your team.
{{cta-demo}}
You can get started with structured retrospectives without specialized software. Based on our experience working with hundreds of enterprises and startups, we put together a Notion template that you can use for your incidents. Feel free to take it and adapt it to your needs.
See Rootly in action and book a personalized demo with our team