July 1, 2025

6 mins

Beyond MTTX: A Case for Qualitative Incident Assessments

This article explores why teams should move beyond simplistic metrics and focus on qualitative assessments to strengthen their resilience

Written by

JJ Tang and Shane Arseneault

Beyond MTTX: A Case for Qualitative Incident Assessments

Table of contents

Despite recurring critiques of Mean Time to Resolution (MTTR) as a reliability metric, the industry continues to treat it as a default performance target. The issue is not merely that MTTR is statistically imperfect; it’s that it’s often used in fundamentally misleading ways.

A common fallacy is the assumption that one team’s MTTx values can be benchmarked against another’s, but operational environments differ radically. If no two incidents are the same, two entire companies are far from comparable in such simplistic terms. One team might find a 30-minute mean time to acknowledge acceptable, while another optimizes aggressively to reduce that figure to under five minutes. Both can be making rational decisions based on the context of their environment resiliency, their customer base, and even the incident being responded to.

To actually enhance incident response, teams should focus on qualitative observations rather than external benchmarks. Examine how real-world incidents unfold. Are responders able to act efficiently? Do existing workflows streamline or obstruct resolution efforts? Do delays in response or resolution actually have adverse effects on the business or stakeholders? These contextual insights reveal far more about what needs to improve than numerical targets ever could.

With that in mind, the following qualitative indicators serve as practical goals for teams aiming to strengthen incident management:

Assessing Alerts and On-Call Practices

All incidents require meaningful intervention

If issues resolve without action or with superficial fixes, your alerting might be too sensitive or your team might be lacking in post-incident learning.

Alerting should never be set up as an FYI that something could happen; only alert on things that are actively happening or have already occurred.

Likewise, if an alert is notifying a team to perform a "quick fix," post-incident learning should tell us it is time to automate the solution or eliminate the root cause.

Everyone understands how to initiate an incident

Communication platforms like Slack should show no ambiguity about the declaration process. Phrases like "Who’s on call?" or "Can someone declare this?" suggest a systemic weakness.

All critical alerts are reviewed within appropriate time frames

What qualifies as "reasonable" will vary, but patterns of neglected alerts should prompt deeper review of alert routing and prioritization. Properly setting priorities for alerts and collecting data on this can allow you to set expectations that are fair.

Assessing Incident Response Dynamics

Every incident has a clearly designated leader

Leadership clarity eliminates confusion about decision-making authority. Uncertainty here creates operational drag.

Incident commanders coordinate; they don’t implement

The person guiding the response should not be writing code or querying logs. Their role is orchestration, not execution.

Incident progress is continuous

Long silences during response calls often signal a lack of direction. Natural pauses happen, but they are precisely when strong leadership should reassert focus.

While some responders are working on more time-intensive tasks, find a valuable way to use everyone else's time. Explore next steps, validate findings and direction, or use the time as a learning opportunity for knowledge sharing about the services involved.

Assessing External Communication

Customer-facing teams are satisfied with public updates

The effectiveness of external incident communications should be evaluated based on feedback from those managing customer relationships. Their insights are a real-time indicator of whether your messages hit the mark.

Root Cause Analyses (RCAs) are written professionally

Treat RCAs as part of your external narrative. They should meet the same quality standards as all other communications.

Stakeholder groups are informed adequately

Ensure the information being shared is relevant and appropriate to the targeted stakeholder group. We should aim to minimize differences in communications, but remain clear: one size does not fit all.

Assessing Post-Incident Practices

All significant incidents are followed by postmortems

While "significant" will differ across teams, any event you would describe that way should prompt postmortem analysis.

Some minor or near-miss incidents also receive retrospectives

A robust culture of learning can only be achieved when the team is motivated to find learning scenarios everywhere they look. It can be tough to find the "what went well" in severe incidents, so use minor incidents as a way to call out large successes contrasted against necessary small improvements.

Postmortems are actively attended and widely referenced

High engagement from uninvolved team members, along with internal circulation of findings, are signs of a healthy learning environment.

Recurring issues are rare

Repetition of similar failure modes suggests postmortem actions are either shallow or ignored. Each incident should yield fresh insights, not merely echo past failures.

More than Metrics

Qualitative assessment demands more time and attention than watching a metrics dashboard, but it pays off. Metrics alone often obscure the nuances of operational health. For example, reducing incident frequency can paradoxically increase MTTR by leaving only the rare, complex cases. Without a qualitative lens, improvements can be misinterpreted as regressions.

When grounded in real-world observations, qualitative signals provide deeper visibility into cultural and procedural weak points. Instead of letting charts run your company, you must enable metrics to serve as support tools, and let the people, and the culture, of your organization dictate your path forward.

‍