

Rootly joins Groq OpenBench with an SRE-focused benchmark
Making LLM evaluations reproducible for real-world SRE workflows
September 9, 2025
5 mins
“Art, in itself, is an attempt to bring order out of chaos.” - Stephen Sondheim
“Take a breath, this isn’t life or death” pinged my mentor on Slack during my first SEV1, noticing the panic in my voice and the sweat on my forehead as I attempted to recall the TSGs, talking points, and engagement models I was shown in my training.
And they were right, mostly.
What we do isn’t life and death (outside of hero-SREs who manage incidents for Life and Safety orgs), though nobody told my body’s physiological threat response this wasn’t important. I knew if I simply followed the standard process everything would work out.
Why was I sweating if this didn’t matter?
This is, of course, a reductive way to talk about something extremely important and high stakes for any business. Your company may have determined a cost of service disruptions illustrating a staggering financial impact for every minute of downtime.
And skilled Incident Managers are required to see complex issues through to mitigation, navigating an unrelenting barrage of unexpected blockers, stakeholder demands, change approvals, communication deliverables, and process red tape. Incident Managers are drivers of action, chaos whisperers, using any means to stop the bleeding, get out of impact, and restore service.
We want to think we’re like Jack Bauer, taking action no matter the cost to get results, but in reality, we’re more akin to the director of a musical – collecting information from a variety of sources, processing it, and harmonizing the inputs to produce a collaboration aimed at a unified goal.
"Artists who seek perfection in everything are those who can't attain it in anything" – Eugene Delacroix
Let’s illustrate this concept with a familiar setting. The alert and page comes through. You click the Slack link to the incident. Your CMO just shared the registration details for the big conference in September and the registration site crashed. It’s go time.
Your training prepared you for this. You’ve got the playbook queued up, sip your coffee, and launch the bridge. The Swarm begins, and like a carefully-choreographed dance, response teams report to the bridge. Comms have gone out, teams are investigating. Everything is going exactly to script. And then it happens.
More alerts fire, leadership joins the bridge, and everyone’s voices start getting faster and louder. The event page was just a symptom. A vendor outage is causing cascading failures across your infrastructure and multiple applications are down.
Suddenly, what was once a scripted process devolves into a series of questions and unknowns. How do we communicate about the expanded scope of impact after we already communicated a registration issue? Which internal teams need to be notified? This impact is too murky to properly assess the severity - how do we make a call? Do we have the right teams on the bridge, and who do we call on first? Your scripted process doesn’t account for this scenario. Then, the CTO interrupts the investigation to ask why the incident isn’t a SEV0. Deep breaths, dim the house lights, and raise the curtain.
Every way in which you were trained will guide you, but only gets you halfway to a solution. Too often, SRE teams are tethered to documented processes. Embrace the artistic process!
Why can this be so challenging? You have reported impact, you have predefined impact scenarios each mapping to specific severities, you have a breadth of experience seeing incidents play out - this should be the easy part. But if it was easy, we’d be able to see every issue coming, build alerting for it, and stop it before it happened. It’s an incident because something about it wasn’t predictable. Every incident is going to look and feel different, even the repeat ones. Trying to fit a square peg into a round hole isn’t going to work, and this is where pre-defined severity processes fall short.
If it was possible to script this, automation could handle it. Incident Managers’ role is to synthesize all the relevant data points and make an assessment (see: judgement call) on the user impact and severity, clearly articulate a concrete problem statement of that impact to stakeholders, and assess the proper severity in accordance with guidelines. Since every scenario is different, the innate human ability to understand the gravity of certain words is key. This goes beyond words - it’s listening for tone, recalling historical events, and understanding the context of what’s happening around the enterprise.
At the end of the day, the skill of filtering out the noise and making a firm, decisive decision on how to characterize an incident is a bit like performing music - anybody can learn how to read music, but an artist elevates what’s on the page via interpretation.
In Part II, we’ll dive deeper into the art of Time Management, Incident Communications, and Bridge Command. To be continued…
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.