The golden hour: Why the first 15 minutes of an incident decide everything

Most incident response advice focuses on tools, alerts, and post-mortems. Gandhi Mathi Nathan Kumar — Principal Incident Commander at Twilio with 14 years running calls that have pulled in up to 100 responders — argues the work that actually matters happens in the first 15 minutes.

In this episode, Gandhi walks through what he calls the golden hour: the window where you decide whether you know what's broken, who belongs on the call, and whether to chase the root cause or reach for redundancy. He gets into why mitigation has to come before diagnosis, why customers trust your status page more than your engineers, and why he once sat with a stopwatch counting how many clicks it took to declare an incident. Along the way: the human side leaders keep underinvesting in, the math of on-call fatigue, and where AI is actually pulling weight in the incident commander seat.

How did you get into incident management?

Close to 14 years now. Fresh out of college in 2009, I joined Tata Communications, a telecom carrier — think AT&T or Verizon — on the L1 help desk team. That was my first time talking to customers, watching hundreds of tickets pour in during outages, and feeling how hot that seat gets in the middle of an incident. I wasn't troubleshooting back then, just providing updates. But that was my real introduction to incident management from a support point of view.

Over the next few years, I moved to L2, then L3 NOC engineering, where you do everything — troubleshoot, communicate, fix. Looking back, I was doing incident management the whole time without knowing that's what it was called. The role didn't really exist as a named position until Google's SRE practice popularized it. My first proper incident commander role came in 2015, right after I finished my master's in telecommunications management. That was the first time I had 100% authority on a call and was responsible for getting the impacted product back to normal.

How has the industry changed across your career?

Up until 2020, things felt fairly flat. Then COVID hit, digitization accelerated, and right after that ChatGPT showed up — and now you can think of any problem and within a week there's a startup solving it with AI. The space has been growing exponentially, especially for incident management roles.

The biggest unlock for me has been AI tools that take cognitive load off the commander. There are a hundred things happening at once during an incident — documentation, summarization, deciding which alerts to look at, parsing what's being said on the call. The minute I stop talking and start documenting or searching the wiki, I'm losing control of the call. That can translate directly to minutes of mitigation delayed. Some Sev0s have 80 to 100 people on them. Keeping all of them aligned is the commander's primary job, and AI tools are what let me actually do that.

What is the role of an incident commander?

A bit of history first: the role goes back to the US government forming FEMA and the National Incident Management System. The original incident commanders were firefighters and disaster relief leads. Take that idea and apply it to technology, where things break all the time and you need a central pivotal point of contact running the show.

You can have several experts on the call — people responsible for fixing, people handling customer communications, people managing executive escalations, people engaging with vendors. The commander is the pivot. Some companies call it incident manager, some call it incident commander, but the function is the same: this person is your single point of contact until the incident resolves, and even beyond. If you have a question on the incident, you go to the commander, and they tell you where to go from there.

If a team is building incident management from scratch, where do they start?

Two pieces — and each has subcategories. The first is the human aspect, which a lot of teams really overlook. At the end of the day, this is a fight-or-flight situation. At 9 a.m. you're enjoying your coffee. By 9:05 every product in the company can be down. The first response from anyone — even people far more experienced than me — is biological. Your body senses danger and reacts. So step one is building a culture where people understand it's okay when things break, and that the team will collectively get them better.

The second piece is the tools and processes — dashboards, alerting, automation, workflows. The phrase I keep coming back to: you want all your tools during an incident to work for you. You don't want to be working them. The tools should be supporting you, almost predicting what you'll need next.

These two halves go hand in hand. The culture, the team, the relationships between engineering, leadership, and support — and then the tooling that wraps around it.

What makes communication during incidents so hard?

You have a small core working on the fix, and a much larger group that's customer-facing — TAMs, support, executives — who are equally important and need progress updates. Some customers want the root cause before the issue is even fixed. You can't say no to them, because they have their own customers asking. So everyone is downstream of you and they all want different levels of detail.

The challenge is communicating enough to show progress, while keeping the engineering core undisturbed. That's where the commander earns their keep — taking the technical detail and reshaping it for the audience. AI is helping a lot here. Almost every vendor is building something that transcribes the incident in real time and summarizes it differently depending on who needs the update.

You once sat with a stopwatch counting clicks. What were you doing?

This is one of my favorite analogies. When most teams build out incident response, they assume more is better — capture everything up front, fill all the fields, declare the impact, declare the customer count. But the more steps you put between an engineer and an open incident, the more you're impeding mitigation. There's a sweet spot.

I was evaluating different vendors, and I literally sat down with a stopwatch. Someone pings me — XYZ product is down, declare an incident. How many clicks does it take? How many fields are required? How easy is it to find the right buttons? One of the asks was: can we declare an incident in under 60 seconds? That exercise showed me how much early stress you can remove from responders just by stripping required fields back to the ones you actually need.

Where is the industry on responder health?

Stress is impossible to fully remove. Even companies with global redundancy have outages. The biggest factors in handling it are rapport and reps.

Rapport: you want commanders, responders, and engineers to know each other well enough to have cohesive synergy on a call. A lot of those relationships actually get built after the incident — in the post-incident retrospective, when guards are down and people can talk openly about what happened. I've personally benefited from that. Next time I'm on call with those same people, it's much easier to work together.

Reps: game days, tabletop exercises, mock incidents. The body relies on muscle memory in fight-or-flight. My analogy is the gym — you don't show up one day and try to lift 150 pounds. You build to it. Same thing here. The more your team rehearses, the less the real incident overwhelms them.

When does an organization need a dedicated incident response team?

Engineers can do both feature work and incident response — many companies start there. But you eventually hit a ceiling where tech debt has compounded so much that you have to choose between shipping new features and paying down reliability work. That's the telltale sign. When the engineering team can't do both, you need a dedicated reliability or incident management team.

A common model is follow-the-sun: one team in the US, one in Europe, one in Asia, eight hours each, year-round. It's much more sustainable. Performance and efficiency drop sharply when someone's on call for more than six or seven hours. These are humans — they need to eat, sleep, take a break. You have to design for that.

How do you balance speed and accuracy in customer communications?

I call the first 10 to 15 minutes of an incident the golden hour. So much is happening in that window, and it pretty much decides the rest of the incident. Do you know what's broken? Do you know the customer impact and the customer experience — those are different things? Can you get a high-level update on the status page quickly? If you don't get something out in those 15 minutes, you've already lost trust.

That window also decides the course of action. Your communication is only as good as the direction you're investigating. If I keep updating customers about something I think I'm doing, and an hour later I have to say "actually, we're going a different way" — I've eroded trust further. The trap most teams fall into is asking what broke this? That's not the question for the first 15 minutes. The question is: how do we mitigate? Customers care about getting back up, not the root cause. The root cause matters in the RFO, later.

The other thing customers hate is repeat status page posts that say nothing new. "Next update in 30 minutes." Then 30 minutes later: still investigating. That tells me as an outsider that you don't know what's going on. Updates need to be timely and progressively better.

Where is AI actually making a difference in incident response?

The most useful application I'm seeing is contextual summarization. We're exploring integrating a model with our internal knowledge base and our incident management tools, and teaching it to translate the same situation for different audiences. Ask one command, and it summarizes for a customer. Ask another, and it summarizes internally with technical detail.

We're even looking at automating status page posts — telling the model: for this incident, pull these metrics, format them this way, push them to the status page in near real time. Because the moment something breaks, your customers go straight to your status page. They want to know whether you've detected it before they did. If they see you have, a huge portion of escalations and support tickets just disappear. That's where I think AI solves a real problem instead of being built for its own sake.

Tell us about your research paper.

I'm working on it with a former colleague who's now a professor at Kansas State University. It's focused on the psychology of human beings in high-stress situations — what happens to us, and how we manage it. This matters because we're in an era of AI where there's a tool for every problem. But the constant across all of it is the human and their emotions. So this paper is a reminder to keep that in mind as the tooling gets better. People can find me on LinkedIn — happy to connect.