Your reliability is only as resilient as the platforms you build on

The tools you depend on can't be single points of failure

Written by

JJ Tang

Your reliability is only as resilient as the platforms you build on

Table of contents

Today, GCP, a major cloud provider, experienced major intermittent outages. These disruptions rippled across the internet, impacting availability, latency, and uptime, reported on the status pages of at least 13,000 companies, including Shopify and OpenAI. But what stood out most wasn’t just that services went down—it was that some of the very tools meant to help teams respond to those outages also went dark.

It’s easy to treat cloud providers as “infrastructure solved,” but incidents like today expose just how quickly cascading failures can happen — especially when your incident tooling is also downstream of the same providers you're trying to monitor or manage. This meant that when customers needed their incident tooling the most—to declare incidents, notify responders, escalate, communicate, and stay coordinated—the tooling was either unavailable or severely degraded. And to be clear: this isn’t about dunking on anyone. Infra is hard. Outages happen. We’ve been there.

The Tools You Depend On Can't Be Single Points of Failure

These events are reminders that resiliency isn't just about fast RTOs or replicated databases. It's about your system design assumptions. And that includes the tooling your teams depend on during high-severity incidents.

When your incident response platform is deployed on a single cloud provider, and that provider goes down, it drags your response capabilities down with it. At that point, you’re not just flying blind—you’re unable to even organize a response crew.

The analogy here is building a fire alarm that melts in the fire.

Software Is the New Front Door

Today, software is the customer experience. Whether you’re a bank, a SaaS company, or a pizza chain—your software reliability is your brand. Downtime isn’t just a technical issue; it’s a broken promise.

And increasingly, the difference between a company that recovers well and one that spirals is the quality of its incident response. That response depends on tooling, automation, orchestration, and communication systems that must be more resilient than the systems they’re supporting.

Multi-Cloud Isn’t Easy, But It’s Necessary

Rootly made the architectural decision early on to be multi-cloud, multi-region, and designed with fault domains in mind. Our control plane is isolated. We assume our own dependencies can and will fail, and we actively design around that reality. Our team includes SREs who’ve lived through some of the most complex outages at places like Google, LinkedIn, and Instacart. In general, we are in the business of doing the boring things right.

We know the pain of having the right alert but no way to act on it.

Outages like today’s disruptions aren’t rare anymore. In the past 12 months alone:

Cloudflare experienced a global routing outage due to BGP misconfigurations.
Azure had identity service disruptions that cascaded into Microsoft Teams and Outlook downtime.
GitHub Actions went down for several hours, blocking deploy pipelines across thousands of engineering teams.

These events aren’t edge cases. They’re part of the environment we all operate in. And the lesson is clear: if your incident tooling is downstream of the same infrastructure you’re trying to manage, then you’re building on sand.

It’s Time To Raise the Bar

We shouldn’t accept that incident response tools are themselves brittle. If we’re going to take reliability seriously—and we must, because customers expect zero downtime—then we need to hold the systems that support reliability to a higher standard.

Your incident tooling should be:

Cloud-agnostic
Built with control planes that assume regional failures
Designed with redundancy and fault-isolation baked in
Transparent about where it runs and how it stays online

At Rootly, we obsess over this because we know how high the stakes are. When your incident management platform stays online, even when everything else is not, your team gets to do what matters: communicate, coordinate, and recover.