Navigating On-Call Compensation for SREs: Strategies and Insights
Discover 5 models of compensation for on-call
February 11, 2022
4 min read
A list of the top nine SRE skills, from incident management, to cloud computing, to networking and beyond.
It’s easy to talk at a high level about what Site Reliability Engineers do: They ensure that IT systems achieve availability and performance requirements.
But which skills, exactly, do SREs need to do do their jobs? That’s a more complicated question.
To answer it, this article walks through the top nine SRE skills that modern SREs (or aspiring SREs) should master. Although SRE skills may vary from one team to the next depending on the types of systems it manages and the main types of reliability challenges it faces, virtually all SREs need a core set of standard skills that allow them to understand and manage the type of complex, distributed systems they will have to support at the typical organization today.
Without further ado, here’s a breakdown of top SRE skills.
The network plays a pivotal role in connecting modern, distributed environments. As such, it’s often the culprit when something goes wrong -- a lesson that Facebook, for example, learned when a networking problem brought down its entire global infrastructure.
Situations like this are why SREs should master networking concepts. Even if their organization also employs networking engineers, SREs need a deep understanding of networking themselves to know when the network is the root cause of an incident and how to resolve network-caused issues effectively.
Like Linux and networking, cloud computing is another category of skill that modern SREs can’t live without.
The reason why is almost self-explanatory: Around 90 percent of businesses use the cloud, and you can’t manage reliability for cloud environments very well if you don’t understand cloud architectures, cloud networking, cloud data storage, cloud observability and so on.
SREs don’t typically help to develop software, but they nonetheless need a deep understanding of how software is written and deployed -- which, at most organizations today, is a process that happens via a CI/CD pipeline.
It’s hard to engineer reliability if you don’t know how to address reliability problems that emerge from application source code or deployment processes. Understanding how CI/CD processes work and which tools drive them is key for virtually every SRE today.
If you come from a Windows background but you want to be an SRE, there’s no getting around it: You’ll need to learn how to work with Linux and other Unix-like systems in addition to Windows.
That’s because, even at organizations that don’t rely heavily on Linux servers, you’re likely to find that Linux and Unix concepts are deeply embedded within other systems that you have to work with. Most public cloud management tools follow the conventions of Linux CLI tools, for example. So do systems like Docker and Kubernetes, even if you run them in a Windows environment.
SREs also don’t usually help to test software pre-deployment. That task falls to developers or quality assurance engineers.
Nonetheless, understanding how software is tested -- and how to use test automation to speed tests and expand test coverage -- is a vital SRE skill. After all, the more thoroughly and efficiently your team can test software, the greater your chances of catching reliability problems pre-deployment, when they are easier to fix and pose a much lower risk to the business.
Securing is another domain that SREs don’t “own,” but where they nonetheless require significant skills.
Indeed, good reliability engineering makes security a priority, and vice versa. SREs who don’t understand security fundamentals are at risk of implementing reliability solutions that are effective from a reliability standpoint, but not necessarily secure.
Although SREs are not DevOps engineers, SRE and DevOps are closely related domains. SREs at most organizations today will be expected to understand DevOps concepts and, in many cases, work alongside DevOps teams.
So, plan to master DevOps skills as part of your SRE skills acquisition strategy.
Perhaps the single most important type of skill for SREs to learn is incident management. Although many roles may participate in incident response, SREs usually take the lead in organizing the incident response team, communicating with stakeholders and devising the best strategy for resolving each incident as quickly as possible.
This means SREs should know how incident response roles are structured and understand incident response concepts. They should also be familiar with incident response platforms, that automate the complex processes required to ensure rapid, effective incident resolution.
In addition to overseeing incident response, SREs may be tasked with managing postmortems. Knowing how to run a postmortem -- as well as when a postmortem is necessary, and when it makes sense to use a “blameless” postmortem approach -- is an essential SRE skill.
The list of SRE skills could certainly go on. Above are only the most fundamental types of skills SREs will need for most modern environments. But if you’re just starting out on your journey to becoming an SRE, the nine skill domains described above are a good place to begin acquiring the knowledge you’ll need to excel in an SRE career.
{{subscribe-form}}