

Beyond MTTR: 7 incident metrics that matter and 3 that don’t
Measure what matters, not what is easier. Learn tips to untangle the different common metrics used by SREs.
March 19, 2025
10 mins
KubeCon doesn’t have an SRE track but we’ve gone through the 300+ sessions that’ll take place in London so you don’t have to.
KubeCon EU ‘25 is taking over London in two weeks, boasting 229 sessions distributed across 22 tracks and 89 CNCF project maintainer talks. With so many interesting sessions to attend, you need to plan your schedule. Going through so many entries can take time, but I’ve got you. I’ve picked the most interesting SRE-adjacent talks that may interest you.
For this edition, I’ve divided my picks into four categories:
I’ll be at KubeCon London, so make sure to say hi! You can find me at the Rootly Booth (S780). Or, join one of the three Happy Hours we’re hosting:
Prashant Gupta and Kruthika Prasanna Simha (Apple) will explore how pre-trained, unsupervised ML models can be leveraged for real-time anomaly detection from the moment your system goes live for the first time.
Using cloud-native tools like Kubeflow for model fine-tuning, they’ll walk through practical techniques to monitor application health and detect anomalies without historical data.
When: Wednesday April 2, 2025 11:15 - 11:45 BST
Where: Level 1 | Hall Entrance N10 | Room E
Add Apple’s talk to your schedule
Querying observability data shouldn’t be a headache. Alolita Sharma (Apple), Pereira Braga (Google), and Chris Larsen (Netflix) will walk us through the Observability TAG’s work on unifying query languages across metrics, traces, profiles, events, and logs that can also support AI applications.
This talk will walk us through the Observability TAG QLS is semantic query language spec proposal. The speakers will dive into design principles, trade-offs, and challenges of balancing simplicity, expressiveness, and performance.
When: Thursday, April 3, 2025, 11:00 am - 11:30 am BST
Where: Level 3 | ICC Capital Suite 10-12
Add the Observability TAG talk to your schedule
Now that OpenTelemetry’s database semantic conventions are stabilized, Marylia Gutierrez (Grafana Labs) will walk us through how to instrument applications with OpenTelemetry SDKs to collect actionable telemetry data from databases.
Marylia will dive into the available SDK implementations across languages and databases, current gaps, and how you can contribute to missing instrumentation.
When: Wednesday, April 2, 2025, 2:30 pm - 3:00 pm BST
Where: Level 1 | Hall Entrance N10 | Room E
Add Grafana Labs talk to your schedule
Alexa Griffith and Ankita Chaudhari (Bloomberg) will share how they’ve navigated the chaos of multi-cluster AI platforms using SLIs, SLOs, and observability dashboards.
From defining meaningful reliability metrics to designing actionable SLO dashboards, they’ll break down best practices for maintaining platform resilience across cloud, on-prem, and hybrid environments. Expect real-world lessons, battle-tested strategies, and practical takeaways to help ensure your AI workloads run smoothly—even at scale.
When: Wednesday, April 2, 2025, 11:15 am - 11:45 am BST
Where: Level 1 | Hall Entrance S10 | Room B
Add Bloomberg’s talk to your Schedule
Guangya Liu (IBM) and Karthik Kalyanaraman (Langtrace AI) will explore how OpenTelemetry can be extended to monitor, trace, and analyze AI-driven architectures. From tracing inference workflows to correlating AI-specific data like model performance and decision latency, they’ll break down the complexities of observability in multi-agent systems.
Expect practical demonstrations, real-world use cases, and insights how OpenTelemetry provides transparency, reliability, and optimization for AI-driven architectures running on Kubernetes.
When: Wednesday, April 2, 2025, 3:15 pm - 3:45 pm BST
Where: Level 1 | Hall Entrance N10 | Room E
Add IBM’s talk to your schedule
Keeping AI/ML models reliable in production is not easy. Celalettin Calis (Chronosphere) will dive into how OpenTelemetry and Fluent Bit can work together to create a powerful open-source observability stack tailored for AI/ML workloads.
You’ll learn how to log and debug models like GPT and BERT, track prompts and their results, and monitor agent performance in production.
When: Friday, April 4, 2025, 3:15 pm - 3:45 pm BST
Where: Level 0 | ICC Capital Hall | Room J
Add Chronosphere talk to your schedule
Vijay Samuel (Principal MTS, Architect, eBay) will share how eBay’s Observability Platform team built AI-powered "Explainers" for telemetry signals. Instead of just dumping data into an LLM and hoping for the best, they combined engineered algorithms with AI to create more predictable and accurate insights.
Vijay will explore how eBay tackled trace interpretation, critical path detection, and dashboard explanations by strategically combining AI and observability.
When: Wednesday, April 2, 2025, 9:33 am - 9:48 am BST
Where: Level 0 | ICC Auditorium
Add eBay’s Keynote to your schedule
Chris Leavoy (Etsy) and Bryan Boreham (Grafana Labs) share their journey of scaling a single Prometheus instance to its absolute limits—running on a 128-core machine with 4TB of RAM and handling up to 500 million metrics.
This talk dives into the challenges of operating one of the industry’s largest Prometheus deployments, covering key lessons in performance tuning, diagnosing bottlenecks, and optimizing metrics volume for resilience.
When: Thursday, April 3, 2025, 11:45 am - 12:15 pm BST
Where: Level 1 | Hall Entrance N10 | Room G
Add Etsty’s talk to your schedule
Abu Kashem (Red Hat) and Stefan Schimanski (Upbound) will take you on a deep dive into the lifecycle of a Kubernetes API request—from its arrival at the API server to its response back to the caller. Using clear diagrams instead of code, they’ll break down the request flow, Kubernetes architecture, and the observability signals (logs, audits, metrics, and errors) that help diagnose issues.
When: Wednesday, April 2, 2025, 3:15 pm - 3:45 pm BST
Where: Level 1 | Hall Entrance S10 | Room C
Add this Kubernetes API talk to your schedule
Building a successful open source project takes more than just great code—it takes a thriving community. Reese Lee (New Relic) and Adriana Villela (Dynatrace) will share insights from their work as OpenTelemetry (OTel) maintainers, covering what it takes to raise awareness, drive adoption, and connect contributors with end users.
When: Thursday, April 3, 2025, 4:45 pm - 5:15 pm BST
Where: Level 1 | Hall Entrance N10 | Room H
Add this OTel Community talk to your schedule
Rootly will have a big presence at KubeCon London. Find us at our booth (S780). Once again, I’d love to catch up with you at one of our Happy Hours:
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.