The Unofficial KubeCon NA ‘25 SRE Track

Andre King

November 3, 2025

I’m so hyped for KubeCon NA! Next week, we’ll be in Atlanta reconnecting with friends, partners, customers, and swapping ideas with leaders and cloud-native experts.

It’s kind of a Rootly tradition now to put together the unofficial SRE track for KubeCon. With over 300 talks crammed into three days, there’s no shortage of great topics to dive into, but only so much time. After a heated lunch debate last week, we narrowed it down to the five must-see SRE-related sessions.

No AI talks this time around. We’re sparing you that, so you can browse fearlessly.

I, personally, tend to only block off one or two talks a day to leave room for exploring trends I haven’t discovered yet and catching sessions that pop up on the fly. Plus, I like having enough time to network and finally meet IRL with folks I only get to see at KubeCon.

Oh, and don’t miss our two Happy Hours:

Mon, Nov 10 - Above the Sky Happy Hour (with Rootly, AuthZed, Checkly, iTmethods, Atolio).
Wed, Nov 12 - Cheers to Observability (with Chronosphere, Rootly, Google Cloud, Thoughtworks, Embrace)

Without further ado: our KubeCon SRE track!

Making Application Rollouts Observable, Actionable and Boring

Vasudev Bongale (LinkedIn) will share how the platform engineering team at LinkedIn manages over 50,000 application rollouts weekly (yes, weekly). Chaos comes to my mind with this scale, but they’ve turned deployment into something predictably dull.

They built a system that tracks every change (from pipeline triggers to kubectl edits and controller-driven mutations), surfaces failure modes (like pod scheduling, init container crashes, image-pull delays), and wraps all of this into out-of-the-box health metrics that developers can actually understand.

When: Tuesday, November 11, 2025 | 2:30 pm – 3:00 pm EST

Where: Building B | Level 3 | B304-305

Add Vasudev’s talk to your schedule

Maximizing Launch Reliability: Ensuring Stable Application Lift-Off and Orbit on Kubernetes

If you’ve ever deployed an app that worked perfectly in staging but face-planted at startup in production, this one’s for you. Hiroshi Hayakawa (LY Corporation) digs into what he calls launch reliability, the often overlooked moment when your service spins up, scales, and either soars or stalls.

This talk will cover practical tuning for startup success: smarter resource allocation, health check optimization, automated warm-ups, and Kubernetes’ new in-place pod resize for CPU bursts. The goal isn’t just faster rollouts but ensuring every restart and deploy lifts off cleanly and stays in orbit.

When: Tuesday, November 11, 2025, 5:45 pm – 6:15 pm EST

Where: Building B | Level 4 | B406b–407

Add Hiroshi’s talk to your schedule

Beyond the Dashboard: Modern Observability for Platform Engineering at Scale

In this panel, Whitney Lee (Datadog), Stevie Caldwell (Fairwinds), Danielle Cook, Khallai Taylor (E.ON Digital Technology GmbH) and Payal Bagga (Intuit) pull back the curtain on what “observability at scale” really means when you’re operating internal developer platforms.

They’ll cover everything from telemetry pipelines and OpenTelemetry adoption, to aligning service-level objectives with business goals, to embedding observability into everyday developer workflows.

When: Wednesday, November 12, 2025 | 2:15 pm– 2:45 pm EST

Where: Building B (Level 3) — exact room TBC

Add this panel to your schedule

Building Resilient Cloud-Native Infrastructure in the Second Decade: TAG Operational Resilience

In this panel by the CNCF Operations Resillience TAG, Rafael Brito (StormForge), Mario Fahlandt (Kubermatic), Saiyam Pathak (vCluster), Alolita Sharma (Apple), and Nabarun Pal (Broadcom) will lay out what resilience means when your infrastructure spans clouds, continents, and compliance regimes.

The group will share how the OpRes TAG is redefining reliability standards across observability, management, sustainability, and Day 2 operations, connecting the dots between business continuity, cost efficiency, and the health of the ecosystem itself.

When: Wednesday, November 12, 2025, 5:30 pm – 6:00 pm EST

Where: Building B | Level 5 | Thomas Murphy Ballroom 2–3

Add this panel to your schedule

But What About Reliability? The Multi-Million Dollar Kubernetes Cost Optimization Question

Zain Malik (Exostellar) and Nibir Bora (Clean Compute) will tackle the tough trade-offs between saving on cloud spend and keeping your Kubernetes workloads reliable at scale.

The interesting part, though, is that they’ve managed to flip the script. They went from 9% resource utilization to 50% while improving reliability.

When: Wednesday November 12, 2025 | 2:15 pm – 2:45 pm EST

Where: Building B | Level 3 | B308-309

Add Zain’s and Nibir’s talk to your schedule

Best AI Incident Management Platforms for 2026

Alexandra Chaplin

July 24, 2026

Real-time telemetry and deploy correlation: How Rootly's AI SRE finds probable root cause in minutes.

Iain MacKenzie

July 24, 2026

Incident management best practices for 2026: A complete guide

Adam Frank

July 21, 2026

You and your teams deserve
modern incident management.

Get a 1:1 demo with one of our technical staff or start your free 14-day trial.

Get started for free

Get a demo

Book a demo