Incident Management Goes to the Olympics
A look at outages and disruptions to the IT systems that power the Olympics, from 1996 to today.
July 15, 2021
5 min read
4 best practices for breaking down silos and establishing a culture of shared responsibility toward reliability.
SREs may “own” reliability engineering. But they can succeed in that role only with help from a variety of other stakeholders. If you can’t collaborate and communicate readily with developers, IT engineers and even non-technical teams like PR and legal, you’ll struggle to optimize reliability engineering.
That’s why de-siloing the organization is such a crucial part of managing reliability. Here’s why breaking down the silos that separate SREs from other teams is so important, and practical strategies for doing so.
At first glance, you may not think of organizational silos (meaning divisions between different groups or business units that hinder communication and collaboration) as a major challenge for SREs. After all, the SRE role by its very nature is a sort of hybrid one that bridges the gap between development and IT operations, the two main components of a conventional IT organization. SREs are supposed to bring both software engineering and IT Ops skills to the table in order to build as much reliability as possible into the systems they manage.
Yet just because the SRE skillset overlaps with that of other disciplines doesn’t automatically eliminate silos between SRE teams and other teams. Those silos have a tendency to persist, for several reasons:
The disconnect between SREs and other technical roles matters, of course, because it hampers the ability of the IT organization as a whole to manage reliability efficiently and effectively. When different parts of the IT organization focus on different pursuits and place different priority levels on reliability engineering, you end up with teams that work toward their own individual interests, rather than optimizing outcomes for the business as a whole.
It’s worth noting that it’s not just silos within the IT organization that make it harder to optimize reliability engineering. Divides between SREs and non-technical business units can be just as problematic.
For instance, SREs don’t typically work alongside or in close collaboration with PR and legal teams. But when an incident occurs, communicating with these teams can be paramount, especially if the incident affects customers in a major way. Legal can help SREs determine what the contractual impact of an incident is, or which service disruptions to prioritize in order to minimize the fallout of SLA violations. Likewise, PR can work with SREs to formulate statements about disruptions and estimated recovery times.
But again, just because SREs should collaborate with these teams doesn’t mean they do. These non-technical teams are typically even more siloed from SREs than are developers and IT engineers.
So, that’s the problem. The real question is: How do you fix it?
Following are four approaches to increasing collaboration between SREs and other stakeholders in reliability engineering.
Your incident response playbooks probably focus first and foremost on the technical procedures that teams will follow to restore service.
But ideally, the playbooks will also cover other operations -- like communications work by the PR team and contract assessment by the legal team -- that are necessary to ensure holistic response to incidents. When you build these processes into your playbooks, you make it easier to achieve close collaboration between SREs and other stakeholders.
SREs often perform various kinds of tests -- like FMEA assessments -- to evaluate the reliability of systems they manage.
But these tests need not be the responsibility of SREs alone. Other stakeholders from across the IT organization and beyond can and should play a role in identifying reliability weak-points and assessing the impact of potential failures within the system.
When you include everyone in reliability testing, you build a stronger culture of shared responsibility.
Ideally, every time a developer writes a new line of code, an IT engineer modifies a production server or a lawyer changes the terms of a customer contract, reliability should be a consideration. But it’s often not, especially within organizations where reliability is seen as something that only SREs have to manage.
To change this, require all stakeholders to assess the consequences for reliability each time they make a change. When thinking about reliability becomes second nature for everyone, you end up with a healthier reliability culture and fewer barriers between SREs and the rest of the organization.
Finally, even as you work to make all stakeholders assume ownership of reliability, remember that your culture should nonetheless remain blameless. Just because everyone shares in reliability engineering doesn’t mean that any one group needs to be held responsible when something goes wrong.
Maintaining a blameless culture surrounding reliability is important for ensuring that stakeholders see reliability not as a burden imposed on them, but as an opportunity to collaborate with other teams and reinforce collective success.
SREs may specialize in reliability engineering, but ultimately, every stakeholder within the business plays a role in building and managing reliable systems. The key to getting the most out of reliability engineering is gaining buy-in from across the organization for collaborating and community with SREs, and breaking apart the silos that have conventionally isolated SREs from everyone else.
{{subscribe-form}}