

When Nothing Changes and Everything Breaks: Why Machine Learning Fails Differently
Why 50% of companies don't monitor ML and how it’s reshaping our understanding of reliability.

I’m so hyped for KubeCon NA! Next week, we’ll be in Atlanta reconnecting with friends, partners, customers, and swapping ideas with leaders and cloud-native experts.
It’s kind of a Rootly tradition now to put together the unofficial SRE track for KubeCon. With over 300 talks crammed into three days, there’s no shortage of great topics to dive into, but only so much time. After a heated lunch debate last week, we narrowed it down to the five must-see SRE-related sessions.
No AI talks this time around. We’re sparing you that, so you can browse fearlessly.
I, personally, tend to only block off one or two talks a day to leave room for exploring trends I haven’t discovered yet and catching sessions that pop up on the fly. Plus, I like having enough time to network and finally meet IRL with folks I only get to see at KubeCon.
Oh, and don’t miss our two Happy Hours:
Without further ado: our KubeCon SRE track!
Vasudev Bongale (LinkedIn) will share how the platform engineering team at LinkedIn manages over 50,000 application rollouts weekly (yes, weekly). Chaos comes to my mind with this scale, but they’ve turned deployment into something predictably dull.
They built a system that tracks every change (from pipeline triggers to kubectl edits and controller-driven mutations), surfaces failure modes (like pod scheduling, init container crashes, image-pull delays), and wraps all of this into out-of-the-box health metrics that developers can actually understand.
When: Tuesday, November 11, 2025 | 2:30 pm – 3:00 pm EST
Where: Building B | Level 3 | B304-305
Add Vasudev’s talk to your schedule
If you’ve ever deployed an app that worked perfectly in staging but face-planted at startup in production, this one’s for you. Hiroshi Hayakawa (LY Corporation) digs into what he calls launch reliability, the often overlooked moment when your service spins up, scales, and either soars or stalls.
This talk will cover practical tuning for startup success: smarter resource allocation, health check optimization, automated warm-ups, and Kubernetes’ new in-place pod resize for CPU bursts. The goal isn’t just faster rollouts but ensuring every restart and deploy lifts off cleanly and stays in orbit.
When: Tuesday, November 11, 2025, 5:45 pm – 6:15 pm EST
Where: Building B | Level 4 | B406b–407
Add Hiroshi’s talk to your schedule
In this panel, Whitney Lee (Datadog), Stevie Caldwell (Fairwinds), Danielle Cook, Khallai Taylor (E.ON Digital Technology GmbH) and Payal Bagga (Intuit) pull back the curtain on what “observability at scale” really means when you’re operating internal developer platforms.
They’ll cover everything from telemetry pipelines and OpenTelemetry adoption, to aligning service-level objectives with business goals, to embedding observability into everyday developer workflows.
When: Wednesday, November 12, 2025 | 2:15 pm– 2:45 pm EST
Where: Building B (Level 3) — exact room TBC
Add this panel to your schedule
In this panel by the CNCF Operations Resillience TAG, Rafael Brito (StormForge), Mario Fahlandt (Kubermatic), Saiyam Pathak (vCluster), Alolita Sharma (Apple), and Nabarun Pal (Broadcom) will lay out what resilience means when your infrastructure spans clouds, continents, and compliance regimes.
The group will share how the OpRes TAG is redefining reliability standards across observability, management, sustainability, and Day 2 operations, connecting the dots between business continuity, cost efficiency, and the health of the ecosystem itself.
When: Wednesday, November 12, 2025, 5:30 pm – 6:00 pm EST
Where: Building B | Level 5 | Thomas Murphy Ballroom 2–3
Add this panel to your schedule
Zain Malik (Exostellar) and Nibir Bora (Clean Compute) will tackle the tough trade-offs between saving on cloud spend and keeping your Kubernetes workloads reliable at scale.
The interesting part, though, is that they’ve managed to flip the script. They went from 9% resource utilization to 50% while improving reliability.
When: Thursday November 12, 2025 | 2:15 pm – 2:45 pm EST
Where: Building B | Level 3 | B308-309