The Role of SREs in Observability

Quentin Rousseau

September 3, 2021

How do you achieve observability, which means the ability to understand the internal state of a system based on external outputs?

The most obvious answer to that question is to deploy observability tools, which can collect and correlate data from multiple sources to provide visibility into the internal state of a system.

But observability also requires the right people -- including, above all, SREs. Although the SRE role originated before most IT organizations were thinking about observability, the explosion of interest in observability in recent years is part of the reason why now is such an exciting time to be an SRE.

To prove the point, here’s an overview of the role that SREs play in observability.

A brief history of SREs and observability

To start out, it’s worth noting that, historically, SREs and observability did not go hand-in-hand.

Site Reliability Engineering, and the Site Reliability Engineer role, originated at Google in the early 2000s. Back then, almost no one was talking about observability within the context of IT. Observability has been a thing in other fields of engineering since the 1960s, but it wasn’t applied to IT work until starting in the 2010s.

So, it would be wrong to argue that there is an intrinsic link between SREs and observability. Both concepts developed independently. And indeed, if you read most articles on observability today, you’ll find little mention of SREs. Likewise, important texts on observability -- such as the Google SRE book -- barely mention observability.

SREs and the observability revolution

That doesn’t mean, however, that SREs have little role to play in observability today.

On the contrary, the observability revolution -- meaning the shift from monitoring to observability that many organizations have undertaken over the past several years -- makes SREs more important than ever.

The main reason why is that SREs are uniquely positioned to help organizations achieve observability, for several reasons.

Observability requires expertise with disparate data sources and systems

One of the main challenges that organizations face is that observability hinges in part on a team’s ability to collect and correlate data from multiple sources. To gain full observability, you may need to figure out how an application performance issue that you identify in a log file relates to a code commit that you can trace through CI/CD data, for example.

Not all engineers possess the broad set of skills required to work with multiple types of systems. Developers typically only know development tools, while IT engineers are only familiar with production environments. But SREs, almost by definition, bring full-stack expertise to the table. Their job is to manage reliability across all facets of an organization’s IT estate, and they know how to collect and interpret data from all components of a system.

For teams struggling with the challenge of observing complex, sprawling IT environments, then, SREs can play a unique role in integrating everything together.

Observability and reliability go hand-in-hand

SREs should also play a central role in observability because it’s only by observing systems that they can determine whether reliability goals are being met.

The main aim of an SRE, of course, is to prevent reliability or performance issues from occurring in the first place. But when something does go wrong, the team learns about it through observability.

Thus, by participating in the observability process, SREs can more accurately determine what they are doing well when it comes to reliability engineering, and where they need to improve. In this sense, SREs can help drive a continuous improvement loop that unites observability with reliability engineering.

SREs excel at incident response

In addition to reliability engineering, SREs possess a unique level of expertise with regard to incident management and incident response.

That matters in the context of observability because incident management and response are what happens when observability reveals a problem. By taking the lead in managing incident remediation, SREs help ensure that the investments that a team makes in observability allow the team not just to find problems, but also to solve them quickly and efficiently.

Observability beyond SREs

To be clear, I’m not saying that SREs alone should drive an organization's observability strategy. Observability requires participation by a variety of stakeholders, including developers, IT Ops engineers, DevOps engineers and even security experts.

But because the conversation around observability has, to date, largely ignored the role of SREs, driving home the role that SREs can and should play in observability is important. It’s hard to imagine a viable observability strategy today that doesn’t include SREs.