Get Rootly's Incident Communications Playbook

Don't let an incident catch you off guard - download our new Incident Comms Playbook for effective incident comms strategies!

By submitting this form, you agree to the Privacy Policy and Terms of Use and agree to sharing your information with Rootly and Google.

Back to Blog
Back to Blog

June 12, 2025

5 mins

Your reliability is only as resilient as the platforms you build on

The tools you depend on can't be single points of failure

JJ Tang
Written by
JJ Tang
Your reliability is only as resilient as the platforms you build onYour reliability is only as resilient as the platforms you build on
Table of contents

Today, GCP, a major cloud provider, experienced major intermittent outages. These disruptions rippled across the internet, impacting availability, latency, and uptime, reported on the status pages of at least 13,000 companies, including Shopify and OpenAI. But what stood out most wasn’t just that services went down—it was that some of the very tools meant to help teams respond to those outages also went dark.

It’s easy to treat cloud providers as “infrastructure solved,” but incidents like today expose just how quickly cascading failures can happen — especially when your incident tooling is also downstream of the same providers you're trying to monitor or manage. This meant that when customers needed their incident tooling the most—to declare incidents, notify responders, escalate, communicate, and stay coordinated—the tooling was either unavailable or severely degraded. And to be clear: this isn’t about dunking on anyone. Infra is hard. Outages happen. We’ve been there.

The Tools You Depend On Can't Be Single Points of Failure

These events are reminders that resiliency isn't just about fast RTOs or replicated databases. It's about your system design assumptions. And that includes the tooling your teams depend on during high-severity incidents.

When your incident response platform is deployed on a single cloud provider, and that provider goes down, it drags your response capabilities down with it. At that point, you’re not just flying blind—you’re unable to even organize a response crew.

The analogy here is building a fire alarm that melts in the fire.

Software Is the New Front Door

Today, software is the customer experience. Whether you’re a bank, a SaaS company, or a pizza chain—your software reliability is your brand. Downtime isn’t just a technical issue; it’s a broken promise.

And increasingly, the difference between a company that recovers well and one that spirals is the quality of its incident response. That response depends on tooling, automation, orchestration, and communication systems that must be more resilient than the systems they’re supporting.

Multi-Cloud Isn’t Easy, But It’s Necessary

Rootly made the architectural decision early on to be multi-cloud, multi-region, and designed with fault domains in mind. Our control plane is isolated. We assume our own dependencies can and will fail, and we actively design around that reality. Our team includes SREs who’ve lived through some of the most complex outages at places like Google, LinkedIn, and Instacart. In general, we are in the business of doing the boring things right.

We know the pain of having the right alert but no way to act on it.

Outages like today’s disruptions aren’t rare anymore. In the past 12 months alone:

  • Cloudflare experienced a global routing outage due to BGP misconfigurations.
  • Azure had identity service disruptions that cascaded into Microsoft Teams and Outlook downtime.
  • GitHub Actions went down for several hours, blocking deploy pipelines across thousands of engineering teams.

These events aren’t edge cases. They’re part of the environment we all operate in. And the lesson is clear: if your incident tooling is downstream of the same infrastructure you’re trying to manage, then you’re building on sand.

It’s Time To Raise the Bar

We shouldn’t accept that incident response tools are themselves brittle. If we’re going to take reliability seriously—and we must, because customers expect zero downtime—then we need to hold the systems that support reliability to a higher standard.

Your incident tooling should be:

  • Cloud-agnostic
  • Built with control planes that assume regional failures
  • Designed with redundancy and fault-isolation baked in
  • Transparent about where it runs and how it stays online

At Rootly, we obsess over this because we know how high the stakes are. When your incident management platform stays online, even when everything else is not, your team gets to do what matters: communicate, coordinate, and recover.

Just wanted to send some kudos for being able to rely on Rootly while the rest of the internet seems to be falling apart.

SRE Leader at a large financial company

Because the only thing worse than an outage is not being able to respond to it.

Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Bood a demo
Bood a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Bood a demo
Bood a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Book a demo
Book a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Bood a demo
Bood a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Book a demo
Book a demo