

How we built an OSS LLM-powered Incident Diagram Generator
Discover IncidentDiagram, an open-source CLI tool that uses LLMs to turn incident retrospectives and codebases into easy-to-understand visual diagrams.
June 12, 2025
5 mins
The tools you depend on can't be single points of failure
Today, GCP, a major cloud provider, experienced major intermittent outages. These disruptions rippled across the internet, impacting availability, latency, and uptime, reported on the status pages of at least 13,000 companies, including Shopify and OpenAI. But what stood out most wasn’t just that services went down—it was that some of the very tools meant to help teams respond to those outages also went dark.
It’s easy to treat cloud providers as “infrastructure solved,” but incidents like today expose just how quickly cascading failures can happen — especially when your incident tooling is also downstream of the same providers you're trying to monitor or manage. This meant that when customers needed their incident tooling the most—to declare incidents, notify responders, escalate, communicate, and stay coordinated—the tooling was either unavailable or severely degraded. And to be clear: this isn’t about dunking on anyone. Infra is hard. Outages happen. We’ve been there.
These events are reminders that resiliency isn't just about fast RTOs or replicated databases. It's about your system design assumptions. And that includes the tooling your teams depend on during high-severity incidents.
When your incident response platform is deployed on a single cloud provider, and that provider goes down, it drags your response capabilities down with it. At that point, you’re not just flying blind—you’re unable to even organize a response crew.
The analogy here is building a fire alarm that melts in the fire.
Today, software is the customer experience. Whether you’re a bank, a SaaS company, or a pizza chain—your software reliability is your brand. Downtime isn’t just a technical issue; it’s a broken promise.
And increasingly, the difference between a company that recovers well and one that spirals is the quality of its incident response. That response depends on tooling, automation, orchestration, and communication systems that must be more resilient than the systems they’re supporting.
Rootly made the architectural decision early on to be multi-cloud, multi-region, and designed with fault domains in mind. Our control plane is isolated. We assume our own dependencies can and will fail, and we actively design around that reality. Our team includes SREs who’ve lived through some of the most complex outages at places like Google, LinkedIn, and Instacart. In general, we are in the business of doing the boring things right.
We know the pain of having the right alert but no way to act on it.
Outages like today’s disruptions aren’t rare anymore. In the past 12 months alone:
These events aren’t edge cases. They’re part of the environment we all operate in. And the lesson is clear: if your incident tooling is downstream of the same infrastructure you’re trying to manage, then you’re building on sand.
We shouldn’t accept that incident response tools are themselves brittle. If we’re going to take reliability seriously—and we must, because customers expect zero downtime—then we need to hold the systems that support reliability to a higher standard.
Your incident tooling should be:
At Rootly, we obsess over this because we know how high the stakes are. When your incident management platform stays online, even when everything else is not, your team gets to do what matters: communicate, coordinate, and recover.
Just wanted to send some kudos for being able to rely on Rootly while the rest of the internet seems to be falling apart.
SRE Leader at a large financial company
Because the only thing worse than an outage is not being able to respond to it.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.