Back to Blog
Back to Blog

December 3, 2025

5 mins

The Hidden Costs of Immature Incident Management

The start of a journey towards a mature SRE practice.

Chris Inch
Written by
Chris Inch
The Hidden Costs of Immature Incident ManagementThe Hidden Costs of Immature Incident Management

Chris Inch is an engineering leader with 20+ years of experience leading teams at scale at companies like Shopify and Wealthsimple (a Rootly customer).

What time is it?

It was 2 a.m. and I was receiving a bunch of notifications on my phone. Or maybe it was 3AM? With eyes barely open, I’m alerted that transactions in our production environment have just cratered. In a sleepy, yet panicked state, I have a look at prod to see what’s happening. After a few seconds of looking at the errors and allowing my brain to catch up, I realize what has happened. The clocks have changed to accommodate daylight savings time, and the monitors that our team added appear to think our traffic plummeted over the last 5 mins. However, everything was OK. It was merely the time change that caused the monitors to fire. Back to bed.

This happened while I was working at a early-stage startup. Our development team was quite used to this sort of thing. We were scrappy, agile, and nimble. That’s great as you’re starting your journey. But, it became exhausting over time - and we ended up paying the price.

We were always chasing the next customer, implementing the next feature, and catering to the ever-shifting needs of the business. We never had time to work on improving our incident management, or adding systems or automations that would not only create much-needed reliability for our software, but also protect the humans who were putting out fires left right and center.

Case in point: that daylight-savings monitor false alarm remained that way for years because we just had too many other things to work on, and we knew it was going to only occur twice a year.

Growing up as a company

This type of situation is quite common in smaller tech companies. It’s the correct decision to find your product-market fit quickly and avoid undue process early on. Innovation, speed, disruption, and agility are your main focus if your startup is in the very early stages.

If you’re lucky enough to secure funding or bring on a few key customers, your company's growth will accelerate, bringing on more developers and shipping more frequently. This will likely also increase your software complexity, and eventually reach a tipping point where a strong consideration for incident management becomes necessary.

Let’s think of this journey as a treadmill:

  • In my experiences at a startup company, the early days were exciting. Things were starting to move, and building software was exploratory and loose. We would dream up ideas on whiteboards and sometimes code them up the same day.
  • As time passed, the team grew, new customers were signed, and it started to feel like the treadmill was moving faster. There were more things to consider, our systems had to be reliable, and we still wanted to code up cool features.
  • Eventually, the treadmill was moving so fast that problems started to occur. New code would cause an incident, and we would have to react. A software patch would be created, tested, and applied. However, we still wouldn’t slow down to have a retrospective, or address the contributing factors that led to the incident.

Eyes forward. The treadmill keeps moving.

The vicious cycle of recurrence and unfixed problems

There are some aspects of life that can be fixed quickly, allowing us to move forward without ever looking back. I remember falling off my bicycle when I was younger, and scraping my knee. This was fixed with a band-aid and a popsicle, and moments later, I was back out cycling with my friends.

However, software problems are much different and often require much more than a “band-aid fix.” Recurring issues are a huge drain on software development teams.

If the same incident occurs a few times a year—or even more frequently—then teams will quickly feel deflated, exhausted, and unmotivated. Sometimes this can even lead to apathy or assumptions that a small incident is recurring, when in reality, a larger incident is at hand.

For example, at the beginning of the article, I described a situation where a time change caused a monitor to fire in the middle of the night. The team knew that it would happen when adjusting for daylight savings, and everyone expected to be alerted at 2AM. It would be a very dangerous assumption to automatically assume that all alerts during time changes were false alarms.

Unfixed problems can reoccur in nastier ways, potentially leading to higher severity incidents, lost development time, and significant stress for the humans responsible for fixing the issues.

Oh, and that’s the other issue with an immature process: human wellness.

The "Hero" anti-pattern

We love a good hero. Action movies, comic books, and even stories in the news glorify and celebrate heroes. We do it in tech as well. At small companies, key individuals quickly emerge as the ones who are “goto” people.

Heroes have great context in the codebase, skills in a specific technology, and a great sense of instinct when it comes to uncovering issues. These individuals are praised when they swoop in to save the day, and when they pull an all-nighter to re-architect a service or system. They are, for all intents and purposes, heroes.

However, in the tech industry, creating and celebrating heroes is an anti-pattern that can comes with long term downsides. Even if your goto people are the most skilled, most selfless folks in the entire company, they will never be immune to exhaustion, nor will they always be available.

Heroes need vacations, time to disconnect, and (spoiler alert) will eventually leave the company entirely. Not only that, but heroes can stifle the growth of other team members. When it is assumed that the hero will always handle a problem, motivation for others to learn or step into a situation drops off dramatically.

It is often a better idea to democratize the information and skills that heroes possess and remove the dependency on that single individual. It’s never too early to start doing this. Typically getting more people involved is achieved using on-call rotations, shadowing, and a formal process around incident management. These are all things we will visit in future parts of this article series.

Start planning for tomorrow yesterday

If any of this article feels even a bit familiar to you, then you’re probably already thinking about how to plan ahead for the future. Many companies of varying sizes face very similar issues: tech debt, poor planning, a brittle platform, recurring issues, and humans that are burnt out. In the next couple articles, I will dive into adding structure, making improvements to tackle problems proactively, leveling up your entire development team, and fixing bugs once and for all.