Back to Blog
Back to Blog

January 20, 2026

5 mins

Rootly is leading reliability on two fronts

Why I joined Rootly.

Adam Frank
Written by
Adam Frank
Rootly is leading reliability on two frontsRootly is leading reliability on two fronts

My father suffered from migraines for years while being on-call. I watched what the constant pressure; never fully off, always waiting, does to someone over time. It’s not just stress. It becomes physical.

Early in my career, I followed in his footsteps. I rotated on-call once every six weeks or so.

Then I became the person “that just knew.” So even when I wasn’t on-call, people called anyway.

One snowy December afternoon, my boss asked me to take a walk with him down the hallway. He wasn’t the talkative type, so even that felt unusual.

Halfway down, he handed me a piece of paper. It said I was doing great and it listed my bonus along with a pay raise.

I read it, looked up, and he just said, “You good with that?”

I smiled, said yes, and went back to work. It felt good. It was good.

But in hindsight, it wasn’t really a compliment. It was a confirmation of the role I’d quietly stepped into: the person the system relied on to make things okay, whether it was my turn or not.

It’s a symptom: we’ve built systems where reliability depends on a few humans compensating for a model that doesn’t learn fast enough.

The loop the industry can’t escape

I’ve spent most of my career in incident response and on-call operations, using products, supporting products, building products, and marketing products.

And for most of that time, the industry has been stuck in the same loop:

1. Something breaks

2. An alert fires

3. Engineers scramble

4. The immediate issue is resolved

5. A retrospective gets written

6. Everyone moves on

7. And then it happens again

We’ve improved the tooling around that loop; better observability, more integrations, cleaner UIs, smarting routing.

But the model hasn’t changed.

We still wait for things to break before we respond.

We still treat reliability as reactive.

We still burn out the same people.

Meanwhile, systems got exponentially more complex: more services, more dependencies, more third parties, more asynchronous failure modes, more surface area for failure.

Where reliability is heading

Here’s what I believe is coming:

The future of reliability isn’t better dashboards. It isn’t another integration. It isn’t AI guessing a root cause after the fact. It’s a shift from reactive → preventative → predictive.

Autonomous reliability loops; systems that can detect, diagnose, and resolve issues before they become impacting, while keeping humans in control of the decisions that matter.

Here’s the simplest way I can explain why this is finally possible.

We’ve crossed a threshold where AI doesn’t just talk about code, it helps write it. Coding agents are already shipping real productivity gains.

If AI can help create the changes that go into production, it can also help verify that those changes won’t create tomorrow’s incident. That’s the missing loop: learn from incidents, detect risky patterns early, propose improvements, validate them in simulation, and ship resilience by design, not by retrospective.

I spent time in robotics, where multimodal data powers real feedback loops that enable autonomy in the real world. Seeing that up close changed how I think about software reliability: the same kind of learning loop; one that prevents failures instead of simply documenting them and enables continuous improvement.

In software, a digital twin becomes the safe environment for that learning loop to validate changes before production feels them.

The autonomous loop (what I mean, concretely)

Imagine a loop like this:

1. Detect risk early – not just “service is down,” but “this pattern looks like the early shape of a potential impacting issue”, or “this is a definitive bug.” 

2. Diagnose likely causes – correlate signals across deploys, dependencies, configs, traffic shifts, historical incidents, and through repos.

3. Propose a change – not “here’s a graph,” but “here’s the specific fix and why it should work.”

4. Validate safely before production – test the change against a digital twin: simulate traffic, latency, saturation, errors, dependency failures—prove the fix reduces risk before it ships.

5. Apply with oversight – a human reviews and approves the critical steps. The system executes the repetitive, error-prone work.

6. Learn systemically – the loop doesn’t just close the risk. It identifies patterns, architectural weaknesses, and recurring failure modes—and helps continued prevention.

Shipping resilience before the failure shows up.

This isn’t science fiction. The pieces exist today:

  • AI that can reason and generate actionable proposals
  • Simulation that can validate changes in controlled environments
  • Autonomous agents that can execute workflows reliably with guardrails

What’s been missing is the platform to bring it together; built for autonomy from day one, not bolted onto reactive tooling.

Why Rootly

When I looked at who was actually building toward this future, Rootly stood out.

Not because of a feature list. Because of the direction.

Rootly is AI-native from the ground up, automating the incident lifecycle today, while building the platform required for what comes next: preventative and eventually predictive reliability.

That matters, because the autonomous loop depends on capturing the right data, structuring the workflow, and making the system learn. You don’t get there by sprinkling AI on top of pager noise.

The business reality mattered too

Strong fundamentals matter.

Rootly has raised $15.2M. In a market where competitors have raised dramatically more, capital discipline isn’t a buzzword, it changes what a company can prioritize. Capital efficiency creates room to build the best product without chasing growth to justify inflated expectations.

I’ve lived the alternative: hard work, long hours, and outcomes that don’t reward the effort because the math never made sense.

Rootly’s approach makes the outcome feel realistic.

And they’re based in Toronto, where I live again. After years of working remotely, that matters more than I expected.

What I believe

Incident response is at an inflection point.

We’re moving from reactive to proactive, with clear lines of sight to predictive. The destination is autonomy with oversight: systems that learn, validate change safely, and help prevent issues.

On-call is personal to me. It always has been. And capital discipline in the world of startups is highly underrated. 

That’s why I joined Rootly.