

Designing for AI with AI
From predictable systems to fluid experiments
July 29, 2025
15 mins
Key capabilities, rollout strategies, and how to start reshaping how you run prod.
Exactly a year ago, I reported on what were then the “latest” advancements of AI applied to incident response. Meta was experimenting with machine learning to find root causes, while Google was investing in AI to write post-mortems. At the time, it felt like a marvellous achievement, something only teams at that scale could pull off.
Fast forward to 2025: AI-assisted post-mortems are now a table-stakes feature in every incident response tool like Rootly. And root cause analysis? It’s the most sought-after capability of what the industry now calls an AI SRE.
This guide explores how AI is transforming SRE from reactive alerting to proactive incident resolution. I’ll dive into core capabilities, implementation strategies, and future trends. My aim is to show how to get started, build trust, and measure impact.
AI SRE is what happens when you take traditional site reliability engineering and supercharge it with artificial intelligence. Instead of just alerting you when something breaks, these systems monitor, diagnose, and even fix infrastructure issues on their own (when possible & risk-free).
Think of it as moving from a dashboard full of blinking lights to a teammate who not only spots problems but understands your systems well enough to troubleshoot them in real time. Powered by large language models and machine learning, AI SRE tools can interpret logs, read between the metrics, and act based on patterns they’ve seen before.
It’s a shift from reactive firefighting to proactive resilience. And it’s changing how modern teams run production.
Today’s production environments are growing faster, and more fragmented, than traditional SRE approaches can keep up (at least not without sacrificing the team’s well-being).
Modern systems aren’t just big; they’re sprawling. A single application might run hundreds of microservices across multiple cloud regions. Each one adds another layer of potential failure. Take a typical e-commerce stack: payment gateways, inventory systems, auth layers, recommendation engines, CDNs. All stitched together with APIs and message queues. It’s a miracle anything works, let alone works reliably.
Engineers are drowning in alerts. Too many are noisy or low priority, but they still break flow and demand attention. The result? Constant context switching between building and firefighting. Burnout sets in. The work that really matters, like making systems more resilient in the first place, gets pushed to the side.
When something breaks, the person who knows how to fix it is often… out of office. Critical context lives in scattered docs, or worse, in someone’s head. And when that knowledge isn’t available in the moment, incident timelines stretch, and impact grows.
AI SRE platforms aren’t just smarter alerts They’re a different way of running production. They tackle the messiness of modern infrastructure with capabilities that go far beyond traditional alerting and incident response.
AI SREs continuously learn from your system by analyzing different data sources. They pull from configs, logs, service maps, past incidents, and even team communications to build a model of how your system works.
For example, an AI SRE might discover that your authentication service has an undocumented dependency on a specific Redis cluster by analyzing API call patterns, even if this relationship isn't explicitly defined in your service mesh configuration. This deep system understanding enables more accurate root cause analysis when issues occur.
When an alert goes off, AI SRE systems don’t just send a Slack alert and wish luck to whoever acks the alert. Immediately, the AI SRE gets to work to provide as much helpful context to the responder.
It queries metrics, scans logs, hits health checks, and traces requests, all in parallel. While a human might check one system at a time, an AI SRE can fan out across the whole stack instantly. That’s how you go from “we’re investigating” to “here’s what’s broken” in minutes, not hours, which turns into a reduced mean time to resolution.
Early this year, Google changed the SRE manual by being able to detect dangerous states in a system. AI SRE platforms try to do that: detect patterns and trends that may indicate issues waiting for the right circumstances to become incidents.
For example, if your database connections are trending upward during peak hours, even if they're still “within thresholds,” the AI flags it and suggests a fix before it tips into a real outage. That kind of foresight is what keeps small issues from becoming full-blown incidents.
AI SRE systems understand the business context behind technical metrics. They know which services are revenue-critical, understand deployment schedules, and can prioritize issues based on potential business impact rather than just technical severity.
A database slowdown in your analytics pipeline might be technically severe but have minimal immediate business impact, while a slight increase in payment processing latency could directly affect revenue and deserve immediate attention.
As GitHub Staff SWE Sean Goedecke says, when you get paged in the middle of the night, you’re far from your peak performance. You’re more of a “tired, confused, and vaguely panicky engineer”. Here’s when an AI SRE can step in during and help out during a production incident, one that would normally eat up hours of engineering time.
It’s 2:47 AM on a Tuesday. Your e-commerce platform starts throwing intermittent checkout failures. Monitoring lights up: elevated error rates in the payment service. But the root cause is not immediately clear. The on-call engineer gets paged, while already neck-deep in a separate database issue.
In a conventional setup, the engineer has to drop what they’re doing and start digging. That means scanning payment service logs, checking dependencies, reviewing recent deploys, and maybe looping in other teammates. It’s a time sink, 30 to 60 minutes, easily, while customers keep hitting broken checkouts in another time zone.
The AI SRE system immediately begins parallel investigations across multiple potential causes:
Within three minutes, the AI SRE identifies that a recent configuration change increased the payment service's database connection timeout, but a separate increase in traffic volume is causing connection pool exhaustion. The system automatically correlates this with a marketing campaign that launched earlier that evening.
The AI SRE bundles all this into a clear narrative for the engineer: here’s what’s happening, here’s the evidence, and here are two paths forward: temporarily increase the connection pool size or implement request throttling. The engineer can approve the recommended action or choose an alternative approach based on business priorities.
You don’t flip a switch and go fully autonomous on day one. Successfully rolling out AI SRE takes a thoughtful, staged approach. One that builds trust, avoids surprises, and sets your team up for long-term success on how they approach reliability.
First, put your AI SRE in observation mode. Let it watch incidents and recommend actions. But don’t let it touch anything yet. This gives your team a chance to vet its insights and see how often it gets things right.
Pay attention to how closely its suggestions match what your engineers actually end up doing. If that alignment is high, it’s a strong signal you’re ready to start trusting it with real decisions.
Once the AI proves itself, start small. Let it automate low-risk, easily reversible tasks, like scaling a staging service during a spike. Over time, expand its reach to more complex remediations as confidence grows.
Make sure there are guardrails. Maybe your payment systems require manual approval, but your internal dashboards can run on autopilot. Define boundaries based on risk, not convenience.
Your engineers’ feedback is your main asset. Every time they agree, reject, or tweak an AI SRE decision, that data should go back into the system to make it smarter. You’re not just deploying a tool, you’re training a teammate.
Track what matters: detection time, resolution time, false positives, and yes, how your team feels about it. Operational efficiency is important, but so is trust.
AI SRE should plug into your existing workflows, not replace, existing operational processes. It needs to work with your incident tooling, communication channels, on-call rotations, and runbooks. Think of it as an extension of your team, not a replacement.
AI SREs can be powerful, but they’re not perfect. There are still a few places where human judgment is fundamental.
AI SRE systems may lack complete business context that human engineers possess. An AI SRE might not understand that a particular service degradation is acceptable during maintenance windows or that certain alerts can be ignored during planned load testing.
Modern distributed infra is messy. Sometimes bugs hide in weird service interactions that only show up under just the right conditions. AI is getting better at spotting these patterns, but it’s not there yet.
Automation without oversight is risky. A wrong move in prod can cost real money or damage trust. Critical systems should always have a human in the loop, with clear rollback paths baked in.
An AI SRE needs to integrate with your monitoring, deployment pipelines, incident tooling, all of it. Expect upfront engineering work to get it wired in. It’s worth the effort, but don’t underestimate it.
AI SRE systems aren’t just a trend. They’re reshaping how teams think about infrastructure reliability. And we’re only at the beginning.
Future AI SREs won’t just respond when something breaks, they’ll anticipate alerts. As machine learning gets sharper, these systems will start spotting the subtle patterns today’s tools miss. And instead of learning in isolation, they could learn from a much broader pool of incident data, across industries, not just inside a single org.
Proactive System OptimizationRather than just responding to issues, AI SRE systems will continuously optimize infrastructure performance, automatically tuning configurations, scaling resources, and implementing architectural improvements based on observed patterns.
Cross-Organization Knowledge SharingAI SRE platforms may eventually share anonymized incident patterns and solutions across organizations, creating a collective intelligence that benefits the entire industry's reliability efforts.
Integration with Development WorkflowsFuture systems will likely extend beyond operations into development processes, providing reliability feedback during code reviews, suggesting architectural improvements, and automatically implementing reliability best practices.
If you’re thinking about bringing AI SRE into your org, don’t treat it like a tool drop. It’s a shift in how your team works. And like any good shift, it pays to roll it out deliberately.
Look at where your pain lives. Repetitive investigations? Noisy alerts? Fragile dependencies? That’s your starting point. The best ROI comes from tackling the work your team’s already tired of doing.
Don’t throw it straight into prod. Pick a non-critical system to test-drive the AI. Let your team learn how it thinks, what it misses, and where it shines, without putting core business workflows at risk.
As trust builds, widen the scope. Add more systems, allow more automation but keep listening. Watch how the AI performs and how your team feels about it. Feedback is fuel here.
Engineers need to know what to expect. Teach them when to trust AI suggestions, when to question them, and how to feed their experience back into the loop. This isn’t about replacing human intuition, it’s about leveling it up.
Success metrics for AI SRE implementation should focus on both technical outcomes and team productivity improvements.
Technical Metrics
Productivity Metrics
Business Impact Metrics
The AI SRE is a shift in how we think about running production. By combining the pattern-recognition power of AI with the hard-earned practices of site reliability engineering, we get systems that don’t just alert on problems but can help resolve them.
Sure, the tech is still maturing. There are limitations, edge cases, and plenty of integration work ahead. But the teams that start experimenting now will be ahead of the curve when more advanced capabilities land.
Success with AI SRE doesn’t come from flipping a switch. It takes thoughtful rollout, tight integration with your workflows, feedback loops that keep improving the system, and a team that understands how to work with their new AI capabilities.
The future of reliability is intelligent, proactive, and collaborative. And the sooner you start that journey, the sooner your team can spend less time firefighting, and more time shipping great things.
Curious where to begin? Start by mapping out your biggest operational headaches and look for high-impact places where automation can lend a hand. That’s where AI SRE can start earning its place on the team.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.