When a Kubernetes cluster’s health status changes to Degraded, the clock starts ticking. Every second that passes before the right engineers are notified increases the risk of a minor issue cascading into a full-blown outage. Manual communication during these critical moments is too slow, inconsistent, and prone to error. Rapid recovery depends not just on fast detection but on immediate, targeted communication.
By integrating your monitoring stack with an incident management platform like Rootly, you can establish a system for auto-notifying platform teams of degraded clusters. These automated workflows instantly route contextual alerts to the correct on-call engineers, dramatically cutting your Mean Time to Resolution (MTTR) and protecting your services.
The High Cost of Slow, Manual Notifications
A degraded cluster can quickly spiral into breached service-level objectives (SLOs) and a poor user experience. The longer it takes for the platform team to investigate, the more severe the outcome. This delay is often rooted in inefficient notification processes that rely on someone noticing a change on a dashboard and then trying to find the right person in a crowded chat channel.
These inefficiencies are made worse by alert fatigue. When engineers are flooded with low-context, non-actionable alerts, they start to tune them out. This creates a dangerous blind spot where a truly critical alert might be missed and creates a poor signal-to-noise ratio where important signals get lost. AI-powered platforms can dramatically improve this ratio by adding context and grouping related alerts.
Automated, targeted notifications solve this. Instead of blasting an entire channel, a smart system delivers a high-context alert directly to the on-call engineer for the affected service. By applying smart alert filtering, you ensure that only the most critical issues demand immediate attention, making every notification meaningful.
Laying the Foundation: Monitoring and Alerting Best Practices
Effective alerting requires comprehensive monitoring. You can't automate notifications for issues you don't see. This means tracking the right metrics and configuring alerts that provide clear, actionable information [4].
What to Monitor in Your Kubernetes Clusters
To reliably detect a degraded cluster, you need to track key health indicators across your infrastructure and applications [3]. Focus your monitoring on these critical components:
- Node Status: Watch for nodes entering states like
NotReady. This is often the first sign of an infrastructure problem and can be tracked in Prometheus using thekube_node_status_condition{condition="Ready", status="false"}metric. - Pod Status: Alert on pods that are stuck in
Pending, repeatedly failing (CrashLoopBackOff), or showing other error statuses. Thekube_pod_status_phasemetric is essential for this. - Resource Consumption: Monitor CPU, memory, and disk usage at both the node and pod level. Set intelligent thresholds that predict resource saturation before it causes a failure.
- Application Health: Track service-specific metrics like request latency, error rates (for example, HTTP 5xx codes), and queue depths. A spike here is often the first sign of a user-facing problem.
- Control Plane Health: Ensure core Kubernetes components like the API server, scheduler, and etcd are healthy and responsive. Their failure can destabilize the entire cluster.
How to Configure Actionable Alerts
The goal of any alert is to trigger a specific, informed action. Use tools like Prometheus and Alertmanager to define intelligent alerting rules. Each alert must contain essential context: what component is degraded, which service is impacted, its severity, and links to relevant dashboards or logs. For instance, an alert for a degraded ArgoCD application should specify the application's name and the reason for its health status [2].
An actionable alert rule in Prometheus might look like this:
- alert: HighErrorRate
expr: (sum(rate(http_requests_total{status=~"5.*"}[5m])) / sum(rate(http_requests_total[5m]))) > 0.05
for: 2m
labels:
severity: critical
service: api-gateway
annotations:
summary: "High 5xx error rate for {{ $labels.service }}"
description: "The error rate for the {{ $labels.service }} is over 5% for the last 2 minutes."
Note the service label. This metadata is crucial for the automated routing that follows.
Building Your Real-Time Notification Workflow
With robust monitoring in place, you can build an automated workflow that connects a detected issue directly to the person who can fix it. This process breaks down into three straightforward steps.
Step 1: Connect Monitoring to Your Incident Hub
Your alerts need a central destination. Whether they originate from Prometheus, Datadog, or a cloud provider's health service like Azure Service Health [7], they should all be routed to an incident management platform like Rootly. This is typically done using webhooks or pre-built integrations. Centralizing alerts allows you to deduplicate, correlate, and process them in a single place before taking action.
Step 2: Create Triggers Based on Health Status
Once alerts flow into your incident hub, you can configure workflows to listen for specific conditions. For example, you can set up a trigger in Rootly that watches for an alert where an ArgoCD application's health has changed to Degraded [6].
When this condition is met, the platform can automatically declare an incident, create a dedicated Slack channel, and start the response process. The trigger logic is simple: IF alert.payload contains 'health.status: "Degraded"' AND alert.source is "ArgoCD" THEN create_incident.
Step 3: Route Notifications to the Right Team, Instantly
This is the most critical step: getting the right eyes on the problem. The workflow uses routing rules to identify the correct on-call engineer based on service ownership defined in your service catalog.
The system then looks up the on-call schedule and pages the engineer directly via their preferred channels—be it Slack, Microsoft Teams, or a phone call. This targeted approach ensures the right person is notified within seconds, bypassing the noise of a general channel. At the same time, stakeholders can be kept informed through automated status page updates and receive instant updates on potential SLO breaches.
Next Level: Real-Time Remediation Workflows for Kubernetes Faults
Once you master automated notifications, the next logical step is to automate the fix itself. Building real-time remediation workflows for Kubernetes faults can further reduce MTTR and free up your engineers to focus on more complex problem-solving.
Here are a few examples of automated remediation actions you can trigger from an alert:
- Automated Rollback: If a deployment via ArgoCD results in a
Degradedhealth status, a workflow can automatically trigger a rollback to the last known-good configuration [1]. - Pod Restart: For a
CrashLoopBackOffalert, a workflow can run akubectl rollout restartcommand for the failing deployment, which often resolves transient issues without human intervention. - Graceful Degradation: When a service starts failing, an automated workflow can interact with a service mesh like Istio to apply a circuit breaker, preventing cascading failures while the team investigates [5].
These automated actions are part of a broader strategy to slash outage time and build more resilient systems.
Build a Proactive Incident Response Process
Transitioning from manual alerts to a system that automatically notifies platform teams about degraded clusters is a transformative step for any engineering organization. It turns incident response from a reactive, high-stress scramble into a proactive and controlled process. The results are clear: faster resolution times, higher system reliability, and engineers who can focus on building value instead of fighting fires.
Rootly provides the incident management hub and workflow engine to build these powerful automations. By connecting your monitoring tools to Rootly, you can start auto-notifying teams and even trigger automated remediation in minutes.
Ready to stop the scramble and build proactive workflows? Book a demo to see how Rootly connects monitoring, alerting, and remediation in one place.
Citations
- https://oneuptime.com/blog/post/2026-02-26-argocd-automatic-rollback-health-degradation/view
- https://oneuptime.com/blog/post/2026-02-26-argocd-monitor-degraded-resources/view
- https://www.groundcover.com/kubernetes-monitoring/kubernetes-alerting
- https://drdroid.io/engineering-tools/guide-for-kubernetes-alerting-best-practices-for-setting-alerts-in-kubernetes
- https://oneuptime.com/blog/post/2026-02-24-how-to-handle-graceful-service-degradation-with-istio/view
- https://oneuptime.com/blog/post/2026-02-26-argocd-notification-triggers-health-status/view
- https://techcommunity.microsoft.com/blog/appsonazureblog/proactive-health-monitoring-and-auto-communication-now-available-for-azure-conta/4501378












