Top 5 Incidents and Outages of 2021
An overview of major IT incidents and outages in 2021
November 19, 2021
5 min read
A history of Site Reliability Engineering from its origins at Google in 2003 to the present.
If you know anything about the origins of Site Reliability Engineering, or SRE, you know that the concept was born at Google.
But why did Google establish the SRE role? And how did SRE spread from the search giant to companies of all types -- including but not limited to Web-scale businesses with massive reliability needs?
Keep reading for answers to these questions as we explore and analyze the history of SRE from Google to the present.
The first SRE team originated at Google in 2003 under the direction of Ben Treynor Sloss, who had begun his career as a software engineer at Oracle and several other companies before joining Google.
Neither Sloss, nor Google in general, have said much publicly about exactly why they created the SRE role. However, likely factors included:
The story of SRE’s gradual expansion from Google into businesses of all types unfolded in two main stages.
The first stage involved the adoption of SRE by other large, Web-scale companies similar to Google. Facebook had an SRE team by 2010, according to a blog post from the time. Netflix established a “core SRE team” by 2016. Uber started writing in the same year about how it uses SRE. LinkedIn was touting its “SRE culture” by 2017.
It’s easy enough to understand why large companies like these would import Google’s SRE concept into their own IT organizations. They faced the same challenges as Google: Starting early-on, they had massive, distributed infrastructures to manage. They also needed to meet ever-steeper user expectations regarding performance and availability. And although most of these companies embraced SRE after DevOps was already well established, that’s probably because it was clear by the mid-2010s that DevOps alone doesn’t guarantee an excellent user experience.
The second, more interesting stage in the history of SRE is the adoption of SRE by “ordinary” companies -- meaning those without huge server farms to manage or billions of transactions to handle each day. Over the past few years, businesses of all types have begun hiring SREs, even if they don’t face special reliability challenges.
There are two possible explanations for why the SRE role has become a core part of IT organizations writ large. The cynical one is that SRE is just a trendy new name for what used to be called IT operations. In other words, companies that hire SREs today perhaps haven’t really changed how they operate; they’ve just adopted a fancier job title for their IT engineers.
But there’s a less cynical explanation for widespread adoption of SRE, too. It boils down to the fact that we live in a world where users have extremely high expectations from websites and applications, and traditional IT operations strategy can’t accommodate them. Today, even if you operate a run-of-the-mill website or a mobile apps with just a few thousand users, you need to make sure you can measure content load times in milliseconds and resolve availability issues in minutes instead of days if you want to keep up with your competition. The concepts, tools and strategies that SREs bring have helped smaller businesses achieve these goals.
That’s a brief summary of how SRE came to exist as we know it today. But where is it headed next?
It’s impossible to predict the future, of course. But if we had to take a guess, we’d say that SRE will become even more widespread at smaller companies. We also foresee ever-greater use of automation tools to streamline SRE workflows in ways that make it more practical for smaller companies to take advantage of SRE even if they lack large in-house IT teams.
If anything is certain, though, it’s that -- despite having originated as a relatively obscure concept within an elite company two decades ago -- SRE is not going anywhere. Even if the rate of creation of new SRE teams levels off, SRE is so well established at this point across companies of virtually all types and sizes that it’s hard to imagine a future where SRE is not a core part of IT strategies everywhere.
{{subscribe-form}}