Google’s State of DevOps 2021 Report: What SREs Need to Know
The four key takeaways for SREs from Google’s State of DevOps 2021 report
June 30, 2022
4 min read
Totally preventing all incidents is not only unrealistic. It’s actually undesirable in some respects.
Ask most SREs how many incidents they’d have to respond to in a perfect world, and their answer would probably be “zero.” After all, making software and infrastructure so reliable that incidents never occur is the dream that SREs are theoretically chasing.
Reducing actual incidents by as much as possible is a noble goal. However, it’s important to recognize that incidents aren’t an SRE’s number one enemy. What matters more than the number of incidents you experience is how effectively you respond to each one.
Plus, there’s value in incidents. They are a learning opportunity. If your business never experienced incidents, it would arguably be facing more risk, not less.
We know: These ideas may sound a little counterintuitive. You might even accuse us of being “pro-incident” – which we sort of are. Allow us to explain.
In many respects, incidents are inherently bad. When an incident occurs, it means something broke. That’s bad. It may also mean that users were disrupted, operations halted or money was lost. Those things are even worse.
On the other hand, incidents aren’t all bad. They actually benefit SRE teams, for several reasons:
We could go on, but the point is clear: Although incidents cause problems in some respects, they actually create value in others.
The above is not to say that you should welcome incidents with open arms. Obviously, any decent SRE should focus first and foremost on being proactive and preventing incidents from happening whenever possible. They should use chaos engineering to identify problems that could be lurking unseen in production environments. They should leverage IaC to minimize risks. And so on.
That said, what ultimately matters more than incident frequency is the effectiveness of incident response. It’s better to experience ten incidents that you resolve in under an hour each than one incident that takes mission-critical systems offline for a week.
So, in addition to investing in tools and processes that mitigate the risk of incidents, SRE teams should place equal emphasis on ensuring that they can react quickly and effectively when an incident happens. This means having the ability to share information efficiently, define clear incident response roles, know what to prioritize when working through complex incidents and have clear plans in place that spell out how you’ll handle a problem as soon as you detect it. Without these abilities, you’re at risk of letting incidents that should be small turn into major outages.
It’s important to recognize, too, that it can be fun to imagine a world where zero incidents occur, the reality is that such a world will never exist. If it could, we wouldn’t see each year setting new records for the number of security incidents that businesses collectively suffer, for example.
Nor would we see headlines about major outages at huge enterprises like Facebook or AWS on a recurring basis. If those companies, which have world-class reliability teams and virtually endless resources at their disposal, can’t reduce incidents to zero, neither can anyone else.
The bottom line: There is no such thing as total incident prevention, no matter how hard you try. And even if there were, that wouldn’t actually be a good thing, for the reasons explained above.
So, by all means, undertake reasonable proactive efforts to prevent as many incidents as you can from happening. But don’t let investment in incident prevention cause under-investment in incident response. Being prepared to handle incidents when they happen – which they inevitably will – is what matters most.
{{subscribe-form}}