Frontline Reliability: Protecting User Journeys with SLOs with Shery Brauner (Razor, ex-Zalando)

🛒

Ex-Zalando SRE leader

🔄

Frontend-to-SRE journey

👀

Observability advocate

🛤️

Focus on user journeys

Table of contents

Sherry Brauner is an engineering leader at Razor and former SRE at Zalando, with a career spanning frontend, backend, and management. She’s known for championing observability and SLO-driven incident management, and for her sharp perspective on what most teams get wrong about reliability (“tooling is only half the equation”). Having built reliability practices at scale, she’s now helping redefine how organizations protect the user journey and give engineers their nights back.

1. Why Protecting the User Journey is the Core of Reliability

Shery emphasizes that reliability isn’t about isolated services: it’s about the end-to-end user journey. At Zalando, her team mapped out the handful of operations that truly sat at the “frontline” of the customer experience, like browsing, adding to cart, and placing an order.

Protecting these edge operations allowed them to catch symptoms early, before customers felt pain or the business started bleeding money. As she puts it: “If you protect that, you’ll be able to deliver a smooth operation and be proactive enough to not gain the experience when it’s too late.”

2. From Noisy Alerts to Meaningful Signals: The Power of SLOs

Traditional monitoring often leads to false positives and “flaky alerts” that wake engineers up at night without delivering actionable insight. Shery recalls a scenario where one engineer was paged 20 times until they muted alerts. And then missed a real SEV1 incident.

The shift to SLO-based alerting changed everything: instead of predicting every possible failure, teams focus on protecting critical operations with clear thresholds. “Your reaction time goes under one minute because you can see the trend, predict it, and fire the alert when it actually matters.”

3. Observability as the Foundation of Proactive Incident Management

For Shery, observability is the bedrock of reliability. Without deep visibility into how services behave, incident management will always be reactive.

She argues that teams should adopt a “tracing-first mindset,” ensuring they can follow dependencies back to the root cause. “At the point that the business is bleeding or the customer is in pain, you’re too late. Observability is what lets you stay ahead of that.”

4. Avoiding Pitfalls: Over-Instrumentation vs. High-Quality Signals

More data isn’t always better. Shery cautions against auto-instrumenting everything, which only creates noise and drowns out meaningful signals. Her approach is “crawl, walk, run:” carefully annotating services to produce high-quality, actionable telemetry. “If you’re not knowing what you do with the instrumentation of that service, you create so much noise… it becomes like finding the needle in the haystack.”

5. The Future of Incident Response: AI, Automation, and Reliability Culture

AI is already helping incident management by surfacing relevant playbooks, reducing cognitive load, and bridging knowledge gaps between teams. But Shery sees an even bigger opportunity: embedding observability as a requirement in CI/CD pipelines, so no service ships to production without a minimum level of visibility. This, she argues, would cement observability as a cultural must-have rather than an afterthought. “I really hope one day I see that observability becomes one of the must-haves for everybody in the mind.”