Mastering SEV0: Best Practices for Handling High-Severity Incidents
Handling SEV0 incidents requires careful but expedite action. Learn how top performing teams deal with them at scale.
December 18, 2024
5 mins
From human alerting chains to underpowered web servers, it was a far cry from today’s automation-driven incident management. Discover how far we’ve come in the evolution of monitoring and why delegating tasks to today’s tools can save you from burnout.
Stefan is currently a Principal SRE at Achievers. With 20 years of experience writing code and designing systems, he has always emphasized monitoring and automation at scale. When not keeping systems running smoothly, he can be found playing guitar with his son.
2008 was a landmark year: a major economic downturn, Obama’s historic election, the Beijing Olympics—and let’s not forget how Flo Rida kept us dancing through it all.
It was also the fifth anniversary of Google’s SRE definition, though most of those ideas were still abstract, if they were even known, by many in the industry. PagerDuty didn’t yet exist.
But incidents certainly did. In my case, they happened a lot either due to a software or a hardware issue. I managed over 30,000 servers in data centers across the globe. Back then, there was essentially no dedicated software to help us manage any part of the incident lifecycle. Alerting, assembling teams, managing incident communications, and conducting postmortems were all painfully manual tasks.
As we approach a quarter-century milestone, I want to share what it was like managing incidents before today’s robust tooling was even imagined.
Back in 2008, I worked in a Network Operations Center (NOC), spending 12-hour shifts glaring at a maze of little black terminal windows and a wall of blue screens. The mission? Watch for red. When red appeared, fix it.
Sometimes there’d be a flood of red, and if my team couldn’t handle it, we became the human alerting system—escalating to someone, who would call someone else, who was probably at the pub with the one person who actually knew what was wrong. They’d most likely stumble into a cab together, head to the data center, and turn something off and back on again.
During these chaotic moments, you'd have to pause and update customers and employees. Try juggling a call, furiously typing commands, reviewing logs, and writing updates about an ongoing meltdown. The tool of choice? A hand-me-down webserver, often holding on for dear life itself. Quickly being starved for its minimum vital resources.
The pressure only grew when a manager popped online demanding a timeline and updates.
It wasn’t just my manager asking for updates. At any given moment, three or four other people could call, message, or knock on the NOC door looking for information. Without a consistent feedback loop—like a live status page—there was no simple way to reassure stakeholders or customers that the incident was being actively addressed.
Eventually, the dream team would call from the data center (with beer, because it was Friday night). Everything would be resolved, and after a quick firewall power cycle, the phone would ring again. It was the Incident Manager checking in, curious about the next update and whether we had a root cause analysis yet. No pressure, right?
Later that night, standing outside with my emergency coffee, I’d look up at our third-floor command center. Those big blue monitoring screens lit up the night like a scene from a tech-noir film. I’d wonder about the future of incident management—and seriously, why no one had ever invested in blinds.
Compared to those times, the two most significant takeaways for me are the evolution of monitoring and the way we now rely on tools to manage an incident’s lifecycle.
The monitoring landscape has evolved dramatically since my blue screen wall days. Companies have moved from “follow-the-sun” support to smart alerting systems that page and escalate automatically. Monitoring systems themselves are now highly sophisticated, but this creates new challenges: ensuring they’re reliable.
Managing your own in-house monitoring solution can be a liability. My recommendation? Let today’s observability platforms do what they do best. Removing the stress of ensuring the reliability and scalability of your monitoring platform is a worthy investment.
Taking on too much responsibility during an outage often leads to mistakes in post-incident reviews. Today, we have tools that document events in real-time, reducing human error.
Solutions like Rootly handle timeline summaries automatically, cutting out manual detective work so teams can focus on fixing issues.
In today’s world of automation, alerting, and AI-generated summaries, fewer eyeballs need to stare at blue screens 24/7. Status pages now connect directly to observability tools, giving customers real-time updates instead of waiting for an under-caffeinated NOC engineer to compile reports. Automation minimizes human error—and stress.