

Reliability Through Fresh Eyes: Inside the Rootly Intern Program
How Rootly is empowering the next generation of engineers to redefine reliability in the AI era.

Brandon Chalk has spent over a decade in Site Reliability and Security Incident Response, leading teams at Google and Databricks. His experience spans from managing large-scale production systems to building cross-functional programs that strengthen operational resilience.
When I triage a page, my first goal is to figure out whether I can resolve the issue on my own or if I need help. That help might range from a quick “Hey, how does this work?” to an all-hands-on-deck situation. Getting that help on time can be the difference between a quick resolution and a painful, hours-long process.
But I’ve often fallen into the trap of waiting too long before reaching out. Earlier in my career, it was usually because I wanted to be the one to fix it, a bit of a hero complex. Other times, I’d go down a rabbit hole so deep that I’d lose track of time. That’s why I started using a shot clock for myself when triaging incidents. Once the buzzer goes off, it’s time to call for help (I’m sharing how I estimate the time to set the shot clock to in a bit).
Acknowledging and resolving a page doesn’t mean you have to handle it alone. Sometimes, you need to rely on others to get across the finish line. In this article, I’ll share how I think about asking for help during an incident.
Every time my pager goes off, the first thing that crosses my mind is: how can I resolve this as quickly and effectively as possible? I need to get back to what I was doing. My goal is to avoid a prolonged interruption, get things back on track and resume what I was working on.
Like pilots reciting a pre-flight checklist, it helps to go through a fixed set of questions, even if some might not seem immediately relevant. It can feel a bit heavy-handed, but the benefit is clear: it reduces cognitive load during incident response because you can simply pull up a list and start working through it.
These are the questions I’ve developed throughout my SRE career at Google and Databricks:
My teams have always tried to document common questions to help us triage effectively when a page comes in. I recommend doing the same, track the ones that work for you and build your own checklist over time.
Once I’ve worked through my initial checklist, the next question is timing: how long should I try to fix this before asking for help? That’s where the triage shot clock comes in.
When I begin the initial triage, I’m always gauging whether this is something I can handle alone or if I need to bring in more support. I timebox this decision by setting a triage shot clock. It acts as a forcing mechanism to ask for help rather than trying to be a hero and digging myself into a deeper hole.
You might wonder: what’s the right length for the shot clock? It can range from a few minutes to a few hours, depending on your role and environment. I usually set it after completing the initial triage. Once I’ve determined the criticality of the alert and have a sense of how much runway I have before needing to bring others in.
For example:
Determining these ranges comes with operational experience in your organization. During incident reviews and postmortems, this can even be an interesting metric to track—when to engage external stakeholders—to help refine your time limits based on the priority and severity of each response.
I’ve often seen responders fall into the trap of trying to be a hero: doing everything on their own instead of asking for help. While the intention is good, waiting too long to loop others in can have a cascading effect. By setting a triage shot clock, you give yourself a clear time limit for when to ask for help, which can significantly reduce an incident’s blast radius.
Earlier, I mentioned that the first thing I ask myself when I get a page is: how can I resolve this as quickly and effectively as possible and get back to what I was doing? What I don’t tell myself is: this is my problem to solve alone. When I set my shot clock, I give myself X minutes to work on the issue before reaching out if I’m not making progress.
Your main objective is to resolve the issue effectively, which often means bringing in others. Think of your role as orchestrating a successful resolution rather than going at it alone. Yes, you’re the first to respond because of the nature of on-call, but you’re still part of a team. Asking for help isn’t a sign of weakness; it shows maturity as a responder. You’re prioritizing system integrity over ego. It also gives partner teams visibility into parts of their software they might not otherwise see if you’re always fixing things in isolation.
It’s easy to fall into the hero mentality. I know I have. But a seasoned responder knows better—they understand that their primary objective is to solve the issue as quickly and effectively as possible, and that often means bringing in help. My goal is to maintain the SLAs and SLOs our team has agreed upon, and if that means asking for help, I’m going to do it.
You might be wondering how this approach plays into performance reviews or how to shift your team’s culture to be more open to outside help. Google’s Alexander Malmberg explores this in detail in “Why Heroism is Bad,” which goes much deeper than I can here. While his perspective is geared toward large organizations, one key takeaway applies everywhere: constant heroism hides systemic issues. Being a hero can work at times—and sometimes it’s necessary—but it shouldn’t become routine.
This mindset clearly benefits larger teams with more headcount, but it’s just as valuable in lean organizations. Asking for help doesn’t mean pulling in a ten-person engineering team; sometimes it’s simply finding the one person who can help. In a healthy, collaborative environment, everyone should be open to pitching in where they can.
A triage shot clock gives you guardrails for more reliable operations. It’s a simple but powerful tool—a hard limit on how long to work alone before bringing others in. Asking for help isn’t a sign of weakness; it’s a sign of maturity, because you’re prioritizing the system’s integrity over your own ego.
Remember: the true mark of success isn’t that you fixed it, but that it got fixed.