

The Complete Guide to AI SRE: Transforming Site Reliability Engineering
Key capabilities, rollout strategies, and how to start reshaping how you run prod.
July 22, 2025
10 mins
Strategies from SRE leaders fighting noisy alerts in complex system.
It’s annoying to get too many alerts. It’s also annoying not to get that one alert that could’ve prevented an incident from going really bad. Getting your system to wake you up at 3 am exactly when it’s necessary is a long-term process, art art, part process, and a little bit of guesswork.
In our latest Rootly roundtable, we sat down with a group of seasoned SREs (collectively packing over 100 years of ops scars) to trade notes on what makes an alert useful, what makes it noise, and how to build alerting systems that teams can trust.
Here are their top strategies distilled for you:
Building an effective alerting strategy isn’t just about what triggers an alert. It’s about deciding when you trust that alert enough to act on it. One approach that teams are successfully using is to split alerts into two streams: low confidence and high confidence.
This separation helps ensure that only trusted, action-worthy alerts can escalate to on-call engineers, while newer or less certain alerts are observed and refined without causing unnecessary noise. The system creates a healthy buffer between signal discovery and operational disruption, allowing teams to tune alerts without getting overwhelmed.
Every new alert starts in the low confidence stream. This stream is essentially a staging area of sorts, useful for catching anomalies like a spike in error rates or unusual behavior that might warrant monitoring, but not immediate escalation. The alert is configured to notify via a lightweight channel, such as a Slack message, rather than triggering any paging system. This way, engineers can keep an eye on the situation, evaluate whether the alert is firing appropriately, and make adjustments as needed.
Importantly, low confidence alerts are excluded from overnight paging. The goal is to avoid waking up engineers for something that hasn’t yet been proven to reflect a real or urgent problem. It’s a safe space for calibration and triage, not for reactive firefighting.
Once an alert has been refined and shown to consistently signal something critical, like a service outage or high-severity issue, it is manually promoted to the high confidence stream. This step is intentional and gated to ensure only high-quality, trustworthy alerts are allowed to page someone at 3 a.m.
The high confidence stream represents operational trust. Alerts here are expected to fire only in true problem scenarios, and they’re directly connected to paging policies. This prevents noisy or overly sensitive alerts from slipping through and protects the on-call experience from unnecessary disruptions.
By explicitly separating alerts into confidence levels and requiring manual promotion, teams maintain clarity and control over their alerting pipelines, building reliability not just into systems, but into the workflows around them.
An alerting strategy becomes far more impactful when it’s rooted in customer experience rather than internal system metrics. Shifting the engineering mindset from infrastructure-first thinking to customer-first outcomes is a powerful way to align alerting with what actually matters to the business. This means moving away from narrowly scoped indicators like CPU or memory utilization, and instead focusing on how well the user-facing parts of the system are performing. The shift may sound subtle, but the impact is significant.
A common example of this approach is favoring alerts on public API response latencies rather than on internal resource usage. Instead of tracking whether a specific container is using too much CPU, teams are encouraged to ask: Is our API responding within the expected time for a customer? That’s what truly reflects service quality. It marks a deliberate effort to pull engineers away from internal system health metrics and toward externally visible performance.
For teams building complex, asynchronous systems the challenge is even more nuanced. These products often rely on workflows where a client triggers an intent, and work is completed at some point in the future. Modeling alerts around these asynchronous experiences requires a deep understanding of the product and how users interact with it. In such cases, ensuring customer-centricity becomes a guiding principle in the design of alerts.
Alerts shouldn’t be permanent. One SRE leader follows three simple principles to keep their alerting clean: alerts must be actionable, relevant, and reviewed regularly. If an alert doesn’t drive a clear response, it shouldn’t wake anyone up. If it doesn’t reflect a real customer impact, it’s probably not worth keeping. And even good alerts can go stale, so regular reviews, whether weekly or monthly, are essential to avoid clutter and keep the signal strong.
Pinning down naughty alerts requires shared understanding, frequent inspection, and intentional upkeep. Teams that treat alerting as a team sport tend to build stronger signal coverage, reduce noise, and respond more confidently to real incidents. Two core practices stand out in this approach: structured game days to validate assumptions, and continuous adjustment as systems evolve.
One successful practice involves setting aside time for scenario-based alert testing, often referred to as Game Days. During these sessions, teams gather to brainstorm three plausible failure scenarios for their service. For each one, they define which alerts should fire, simulate the failure (like network interruptions or service crashes), and then observe what actually happens.
The results are rarely perfect. Sometimes expected alerts don’t fire at all; other times, too many alerts go off. These sessions help teams identify which signals are actually useful and which need tuning or removal. The emphasis is on clarity and actionability: alerts should be easy to notice and tied to a clear response. Over time, the exercise helps engineers sharpen their intuition around what makes a good alert and surfaces blind spots in existing coverage.
Beyond game days, teams keep alerts healthy by integrating reviews into natural workflows. Postmortems are a key moment to reflect on whether alerts fired appropriately or failed to detect root causes. If new failure modes are discovered, new alerts may be added but only if they provide clear value.
Alerts are also revisited during architectural changes. Rather than treating this as a checkbox, teams consider whether existing metrics and thresholds still apply. This might result in a quick adjustment or a more significant overhaul, depending on the scope of change. Keeping alerts aligned with evolving systems helps avoid stale signals and prevents unnecessary noise.
Ultimately, this ongoing review process helps teams avoid the trap of alert bloat. Rather than defaulting to vendor presets or accumulating alerts over time, teams make deliberate decisions about what stays, what goes, and what needs to change.
As time passes, you’ll be paged by noisy alerts countless times. An SRE leader explained their team has invested on making the most of the wealth of information they have from previous alerts and incidents. His team is investing in internal tooling that anlyzes alerts and creates feedback loops to improve alert quality over time. Over the last quarter, this approach led to a roughly 40% reduction in overall alert volume.
The team began by identifying key performance indicators that could serve as proxies for noise. These include alert frequency, auto-resolve rate, acknowledgment rate, mean time to resolve, and the distribution of alert occurrences over a rolling window. For example, if an alert fires across 80% of days in a 30-day period, resolves itself quickly, and is rarely acknowledged, it’s likely not providing meaningful signal. The pattern suggests something “flappy” or routinely ignorable.
By treating these KPIs as leading indicators of low-quality alerts, the team developed internal tools to track them across all alerts in the system. This data is surfaced back to engineering teams in weekly operations reviews, prompting them to reassess alerts that meet noise-prone patterns.
To help engineers act on these insights, the team developed an “alert quality index”, a composite score calculated by combining the proxy metrics into a single value between 0 and 1. A perfect alert would score closer to 1, while noisy, low-value alerts trend toward 0. This index gives teams a straightforward starting point to identify their worst offenders.
Visibility into these metrics is embedded into daily workflows. Alert data is extracted from their incident platform (originally in PagerDuty, now migrated to Rootly), transformed via Python scripts, and exposed as Prometheus metrics. Dashboards let each team drill into their own top noisy alerts.
Alongside this, teams are supported through programmatically generated Jira stories that target their worst alerts, as well as a structured weekly review checklist that includes reflecting on alert quality using the dashboards. This two-pronged approach, automated nudges plus process integration, has been key to driving adoption and reducing alert fatigue.
While proxy metrics offer strong signals, they can’t fully replace human judgment. The team has started exploring ways to directly measure precision (how many alerts were actually useful) and recall (how many critical conditions were correctly caught by alerts). To compute precision, they rely on humans tagging alerts as useful or not, though engagement is a hurdle.
To encourage more consistent feedback, they’re experimenting with embedding signal/noise tagging into the alert resolution flow. Rootly now allows engineers to flag alerts as signal or noise. The hope is to eventually gate alert closure behind filling out a minimal metadata field, making it easier to gather human-labeled data at scale.
In the meantime, proxy signals, like how quickly an alert is acknowledged or whether it’s ignored, continue to guide assumptions. When humans don’t engage with alerts promptly, it’s often a sign they’re not viewed as critical. By blending direct tagging with intelligent heuristics, the team is building a robust system that continuously refines itself based on both observed patterns and real human context.
Letting developers feel the pain of their own alerts is one of the fastest ways to improve signal quality. When they’re on the receiving end, especially for alerts they want escalated as P1 or P2, they quickly see whether those alerts are actually actionable or just noise. This firsthand experience drives sharper decisions about what should page and what should stay in a dashboard
Before any alert goes to the central incident team, developers are asked to take the page themselves. If it fires 30 times a week, they’ll feel the pain and think twice. They’re pushed to ask: Is this truly urgent? Does it trigger a clear action? Often, alerts get tuned or scrapped before escalation even becomes a question.
A central incident manager acts as a quality gate. Teams can create alerts freely, but if they want those alerts tied to incidents or pagers, they go through review. Alerts are tested over a trial period (sometimes just sent to an inbox) and if one fires 73 times in two weeks, it’s flagged. Only alerts that are actionable, documented, and well-scoped make it through. The result: less noise, better signal, and more sustainable on-call rotations.
Across all these strategies, one theme stands out: good alerts don’t happen by accident. Whether it’s separating alerts by confidence level, designing them around real customer impact, or forcing developers to live with their own signals, the most effective teams treat alerting as a discipline not a checkbox. It requires clear principles, deliberate tooling, and most importantly, ongoing human attention.
The SRE leaders in this roundtable aren’t chasing perfection: they’re building systems that adapt. They review, they revise, and they challenge assumptions constantly. Because in the end, the goal isn’t to have more alerts, it’s to have the right ones.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.