

8 Modern SRE Techniques That Drive Proactive Reliability
From chaos engineering to config validators, discover how top teams stay ahead of outages
July 1, 2025
6 mins
This article explores why teams should move beyond simplistic metrics and focus on qualitative assessments to strengthen their resilience
Despite recurring critiques of Mean Time to Resolution (MTTR) as a reliability metric, the industry continues to treat it as a default performance target. The issue is not merely that MTTR is statistically imperfect; it’s that it’s often used in fundamentally misleading ways.
A common fallacy is the assumption that one team’s MTTx values can be benchmarked against another’s, but operational environments differ radically. If no two incidents are the same, two entire companies are far from comparable in such simplistic terms. One team might find a 30-minute mean time to acknowledge acceptable, while another optimizes aggressively to reduce that figure to under five minutes. Both can be making rational decisions based on the context of their environment resiliency, their customer base, and even the incident being responded to.
To actually enhance incident response, teams should focus on qualitative observations rather than external benchmarks. Examine how real-world incidents unfold. Are responders able to act efficiently? Do existing workflows streamline or obstruct resolution efforts? Do delays in response or resolution actually have adverse effects on the business or stakeholders? These contextual insights reveal far more about what needs to improve than numerical targets ever could.
With that in mind, the following qualitative indicators serve as practical goals for teams aiming to strengthen incident management:
If issues resolve without action or with superficial fixes, your alerting might be too sensitive or your team might be lacking in post-incident learning.
Alerting should never be set up as an FYI that something could happen; only alert on things that are actively happening or have already occurred.
Likewise, if an alert is notifying a team to perform a "quick fix," post-incident learning should tell us it is time to automate the solution or eliminate the root cause.
Communication platforms like Slack should show no ambiguity about the declaration process. Phrases like "Who’s on call?" or "Can someone declare this?" suggest a systemic weakness.
What qualifies as "reasonable" will vary, but patterns of neglected alerts should prompt deeper review of alert routing and prioritization. Properly setting priorities for alerts and collecting data on this can allow you to set expectations that are fair.
Leadership clarity eliminates confusion about decision-making authority. Uncertainty here creates operational drag.
The person guiding the response should not be writing code or querying logs. Their role is orchestration, not execution.
Long silences during response calls often signal a lack of direction. Natural pauses happen, but they are precisely when strong leadership should reassert focus.
While some responders are working on more time-intensive tasks, find a valuable way to use everyone else's time. Explore next steps, validate findings and direction, or use the time as a learning opportunity for knowledge sharing about the services involved.
The effectiveness of external incident communications should be evaluated based on feedback from those managing customer relationships. Their insights are a real-time indicator of whether your messages hit the mark.
Treat RCAs as part of your external narrative. They should meet the same quality standards as all other communications.
Ensure the information being shared is relevant and appropriate to the targeted stakeholder group. We should aim to minimize differences in communications, but remain clear: one size does not fit all.
While "significant" will differ across teams, any event you would describe that way should prompt postmortem analysis.
A robust culture of learning can only be achieved when the team is motivated to find learning scenarios everywhere they look. It can be tough to find the "what went well" in severe incidents, so use minor incidents as a way to call out large successes contrasted against necessary small improvements.
High engagement from uninvolved team members, along with internal circulation of findings, are signs of a healthy learning environment.
Repetition of similar failure modes suggests postmortem actions are either shallow or ignored. Each incident should yield fresh insights, not merely echo past failures.
Qualitative assessment demands more time and attention than watching a metrics dashboard, but it pays off. Metrics alone often obscure the nuances of operational health. For example, reducing incident frequency can paradoxically increase MTTR by leaving only the rare, complex cases. Without a qualitative lens, improvements can be misinterpreted as regressions.
When grounded in real-world observations, qualitative signals provide deeper visibility into cultural and procedural weak points. Instead of letting charts run your company, you must enable metrics to serve as support tools, and let the people, and the culture, of your organization dictate your path forward.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.