Rootly | AI didn’t “arrive” at KubeCon 2025. It took the Pager.

KubeCon + CloudNativeCon North America 2025 in Atlanta felt different.

Last year, everyone was demoing “AI assistants for your Kubernetes cluster” – lots of chat, not a lot of "we do this". This year, AI wasn’t just answering questions. It was taking actions:

Scaling workloads.
Rolling back bad deploys.
Moving things off flaky nodes.
Opening incidents and updating status pages.

If you care about reliability, that’s both exciting and mildly terrifying.

At Rootly, we live at the intersection of incidents, automations, and humans trying to get some sleep. Here’s how we’d summarize the real reliability lessons from KubeCon 2025, filtered through an on-call lens.

1. AI is no longer your “helper” – it’s an operator.

Kubernetes already had “operators” as a concept. Now AI wants that job title too.

A lot of demos looked roughly like this flow:

Something looks off (error rate spikes, queue backlog grows).
An AI agent correlates signals across logs, metrics, traces, config, Git, and change history.
It proposes actions:
- “Roll back deployment checkout-service from v42 → v41”
- “Drain node k8s-gpu-17 and reschedule pods; suspected hardware issue”
- “Reduce traffic to canary by 50%; SLO burn rate > threshold”
Depending on policy:
- It either executes directly, or
- Opens an incident, posts in Slack, and waits for a human :thumbsup:.

The keyword there is policy.

The new job: design guardrails, not just runbooks.

SREs used to write runbooks like:

“If 5xx > 5% for 5 minutes, try rolling back the last deployment.”

Now the work looks more like:

# ai-change-policy.yaml
service: checkout
allowed_actions:
  - type: rollback
    max_versions_back: 1
    require_approval: true
    approvers:
      - team: checkout-oncall
  - type: scale
    min_replicas: 3
    max_replicas: 50
    require_approval: false
safety_checks:
  - name: must_not_reduce_availability_below
    slo: checkout-availability
    threshold: 99.0
logging:
  audit: true
  incident_tag: "ai-initiated-change"

‍Instead of telling a human how to fix things line-by-line, you’re telling an AI operator what it’s allowed to touch and under which conditions.

What this means for reliability

You’ll see fewer “human fat-fingered prod” incidents and more “agent did something dumb” incidents.
The difference between “magic” and “chaos gremlin” is your policy design.
Incident timelines now need to include: “At 02:13, AI operator “orca-ops” rolled back checkout from v42 → v41.”

This is exactly where Rootly-style audit trails, incident timelines, and approval flows become non-negotiable. If a robot is going to touch prod, you’d better be able to answer “who did what and why?” in under 10 seconds.

2. Platform engineering is now AI-native and reliability-first.

Once again, KubeCon was full of platform engineering stories. The change this year: those platforms are being rebuilt around AI workloads and reliability, not just “make devs happy.”

The emerging pattern:

Internal platform exposes:
- Golden paths (templates, pipelines, environments).
- Opinionated configuration for Kubernetes, networking, and storage.
- Built-in SLOs, alerting, and incident hooks.
On top of that:
- AI agents act as operators and copilots for the platform.
- Teams ship apps by following paved roads, not reinventing Helm every quarter.

Reliability as a platform feature, not a dashboard.

Historically, SLOs lived in Prometheus/Grafana and incidents lived in a random Notion folder. Now teams are wiring them directly into the platform API.

For example, instead of:

“Oh, we should probably define SLOs for this service someday.”

You get:

# service.yaml
name: checkout
owner: team-checkout
runtime: "kubernetes"
slo:
  availability:
    objective: 99.9
    window: "30d"
    alert_policies:
      - type: burn_rate
        fast_burn: "1h > 14d"
        slow_burn: "6h > 30d"
incident:
  channel: "#incidents-checkout"
  runbook: "https://runbooks.internal/checkout"
  severity_mapping:
    - condition: "slo_breach"
      severity: "SEV-2"

‍‍Now when you spin up a service, you must define:

An owner
SLOs
Where incidents go
Which runbook to use

No SLO? No owner? No deploy.

What this means for reliability

Reliability becomes a product of the platform, not heroics by a single SRE.
AI operators and incident tools (like Rootly) can call standardized APIs instead of dealing with snowflake services.
Multi-tenant teams get consistency: same severity levels, same workflow, same automation patterns.

3. Observability is turning into an AI decision engine, not a dashboard showroom.

“Chat with your logs” is cute. “Let AI drive your incident response using everything observability knows” is useful.

The more serious KubeCon stories were about:

Correlating “what changed?” with “what broke?” and more importanlty "why"
Building causal graphs from traces, metrics, and events.
Having AI suggest:
- Probable root cause
- Impacted services / customers
- The next two things you should try

Imagine this in a real incident:

AI: “Error rate for checkout increased 8x starting 02:07.
Correlated change: new config applied to rate-limiter at 02:06.
Similar incident: SEV-1 on 2025-01-14. Mitigation last time: revert config and flush cache.”

That’s not science fiction. It’s mostly a question of data plumbing and how good your past incident/retrospective data is.

Garbage in, AI-flavored garbage out.

AI can’t magically fix:

Unlabeled metrics (custom_metric_17)
Trace spans with zero useful attributes
Incidents with descriptions like “prod broken, fixed now”

If you want AI to be helpful during an outage, you need:

High-signal telemetry
- Good tags (service, version, region, customer tier)
- Clear service graphs
Structured incident data
- Tags (services affected, root cause category, mitigation type)
- Consistent severity and status fields
- Retrospectives that actually explain what happened

What this means for reliability

Observability isn’t the end state; it’s the input for automated decision-making.
Investing in better incident hygiene (like structured Rootly incidents + retros) is effectively “training data prep” for future AI-driven incident response.
Faster MTTR becomes a combo of:
- Good data
- Good AI models
- Smart automation hooks (restart, rollback, failover) wired into your incident tool.

4. Reliability for AI workloads: GPUs, data, and model endpoints.

A lot of hallway conversations in Atlanta were about a simple pain: “Our GPUs are either idle or on fire, and both are expensive.”

Running AI workloads on Kubernetes at scale introduces new reliability dimensions:

New reliability concerns

GPU capacity and scheduling
- Under-provision and your inference SLOs die.
- Over-provision and finance shows up with a flamethrower.
- Fragmentation and bin-packing issues can cause random queue spikes.
Data pipelines and feature stores
- A slow or stale feature pipeline is an incident, even if your pods are “healthy.”
- “Model using stale features for top-tier customers” is absolutely a SEV-1.
Model endpoints
- SLOs move beyond HTTP 200s:
  - Time-to-first-token / latency
  - Model availability per region
  - Freshness of deployed model version

So instead of just:

slo:  
  availability: 99.9

‍You’ll start to see:

slo:
  availability: 99.9
  latency_p95_ms: 800
  ttfb_p95_ms: 300
  gpu_queue_time_p95_s: 10
  feature_freshness_max_delay_s: 60

‍‍What this means for reliability

SREs now own GPU and model SLOs, not just pods and nodes.
Some of your highest-value incidents will have root causes like:
- “Insufficient GPUs in region X”
- “Feature pipeline lag > 5 minutes for premium customers”
Incident tooling needs to understand and surface these AI-specific SLIs, not just HTTP status codes.

If this sounds like overkill, look at how many AI-native companies are already tagging incidents with things like model=“gpt-foo” and tier=“realtime”. That’s where the world is heading.

5. Security and governance are now hard prerequisites for reliability.

Giving an AI agent kubectl access is… a choice.

A recurring KubeCon theme: you can’t separate security, governance, and reliability once autonomous agents are in the mix.

Misbehaving agents are a new type of incident.

New failure modes you’ll see:

Agent auto-scales something to zero under “cost optimization.”
Agent applies a “safe” network policy that quietly blocks a critical dependency.
Agent rolls back the wrong deployment because the environment metadata was messy.

Preventing this isn’t just about “better AI.” It’s about:

Least privilege for agents
- Separate identities for each agent.
- Scoped permissions per service / namespace.
Policy-as-code
- Encode what agents may do in production as declarative policy.
Auditability
- Every AI-initiated action shows up in:
  - Audit logs
  - Incident timelines
  - Retrospectives

Something like:

actor: ai-operator/orca-ops
time: "2025-04-03T02:13:21Z"
action: "rollback"
resource: "deployment/checkout-service"
from_version: "v42"
to_version: "v41"
reason: "slo_burn_rate_exceeded"
incident_id: "INC-2025-042"
approval:
  required: true
  approved_by: "@oncall-sre"

‍‍What this means for reliability

Some of your worst outages in 2026 won’t be caused by humans or hardware – they’ll be caused by well-intentioned, overpowered agents.
The line between “security incident” and “reliability incident” gets blurry when an agent goes rogue.
Tools like Rootly need to make it trivial to answer:
- “Was this change human-initiated or AI-initiated?”
- “Which policy allowed it?”
- “How do we prevent it next time?”

So… what do you actually do on Monday?

Here’s how we’d turn the KubeCon noise into a concrete plan.

1. Design your first AI operator policy.

Pick one or two low-risk actions for a single service, for example:

Restart pods for a stateless service.
Propose (not execute) rollbacks when SLO burn crosses a threshold.

Define:

What the agent can do.
What requires human approval.
How you’ll log and observe those actions.
How those actions show up in incidents/retrospectives.

Start narrow. Widen the radius slowly.

2. Expose reliability as part of your platform API.

If you maintain a platform:

Require new services to define:
- Owner/team
- SLOs
- Incident channel
- Runbook URL
Wire this into your incident tooling (Rootly, paging, status page) via automation, not a checklist doc.

Your future self will thank you when you’re trying to debug a 3am incident and everything has a clear owner and SLO.

3. Treat incident data as training data.

Boring but huge:

Normalize severity levels.
Always tag incidents with affected services and components.
Make retrospectives structured and searchable.
Standardize fields like:
- root_cause_category
- mitigation_type
- detection_source

You’re not just writing history; you’re training the next generation of AI responders.

4. Add AI-specific SLIs/SLOs where it matters.

For your AI-heavy paths, define SLIs like:

GPU queue time
Model endpoint latency/TTFB
Feature freshness
Model rollout error rate

Wire those into your existing SLO / incident workflow. A model with stale features is just as much a reliability problem as a 500.

5. Put governance in front of your agents.

Before you let an AI touch prod, answer:

What can it change?
In which environments?
Under which conditions?
Who approves?
How is everything logged and surfaced in incidents?

If you can’t write that down in a simple policy file, you’re not ready to give it real power.

Where Rootly fits into all this

This isn’t a stealth pitch, but it’s the lens we see the world through:

AI operators will need a reliable incident backbone to open incidents, notify humans, and document actions.
SREs will need clean, structured incident data to train better automation.
Leaders will need audit trails when AI agents start making real changes in production.

That’s exactly the space Rootly lives in: turn chaos into structured data and repeatable, automated workflows — whether the actor is a human, a bot, or something that just got announced on the KubeCon keynote stage.