Get Rootly's Incident Communications Playbook

Don't let an incident catch you off guard - download our new Incident Comms Playbook for effective incident comms strategies!

By submitting this form, you agree to the Privacy Policy and Terms of Use and agree to sharing your information with Rootly and Google.

Are AI and Platforms Making SRE Obsolete? With Kaspar von Grünberg, Humanitec’s CEO

📝
Newsletter author
🐻
Based in Berlin
🇧🇷
Brazil Platforms Fan
👶
Platform Dad

Listen on Spotify and Apple Podcasts!

Table of contents

Kaspar von Grünberg is the CEO of Humanitec and a leading voice in the platform engineering space. He’s known for advocating golden paths and his hot takes (several in this episode, ex. “LLM-based automation is honestly, bullshit.”). As founder of PlatformCon, he’s helping define what modern developer platforms should look like.

The Evolving Role of SRE

For Kaspar, the role of the SRE has not really evolved over the years. At most companies and references, SREs remain these highly skilled engineers who understand the systems and can come in when terrible things happen.

However, since Google’s conception of Site Reliability Engineering in 2004, sofware development has radically changed. That’s why Kaspar sees the role becoming obsolete, or rather, in need for a transformation. This is a realization that Google came to as well. At the end of last year, Google SREs talked about their new SRE model around STAMP for the first time, but its applicability for the average company remains to be seen.

For Kaspar, the problem with the traditional SRE role is that it is build around a traditional way of building software, which he compares to “little artisanal workshops.” So far, SREs have been cleaning up after many artisanal software teams. But now, with the introduction of platform engineering, the responsibilities of SREs are shifting.

Digital Factories and the Industrialization of Software Development

Kaspar sees the change in software production comparing it to the evolution in automotive development. From 1800 to 1906, you had a single person building an entire car on their own at a time. But then, in 1907, Henry Ford introduced the production line.

For Kaspar, it is clear the IT industry of application development is at this period point. The expectation will be that team should be able to churn out features and applications at an unparalleled speed. “That is only possible if we rely on heavy abstraction, on heavy automation, and on what I would refer to as digital factories,” explain Kaspar.

Reliability as Part of the Golden Path

Very often, incident management is an afterthought*.* Developers think of observability as something that will have to do and retrofit once their software is already running in production. Kaspar puts it bluntly, “that’s not a golden path. That’s just… terrible.”

The key to golden paths is to make it very easy for individual users to self-serve something. And, if they're staying on that path, offering them certain guarantees and defaults that they get for free for following this path.

Examples of the guarantees could include incident management. Kaspar explains that observability can integrated by default—by making sure you have the right sidecars, by making sure you’re collecting the right stuff, by making really sure you have this ingrained.

“I’m a big believer that the platform interfaces should be designed in a way that they’re meeting the users where the user is,” argues Kaspar. For developers, those interfaces may be found in Visual Studio Code, in Slack, in Teams, in the CLI, in the pipeline, in an LLM chatbot.

AI and LLMs in Developer Experience

The industry is embracing an “agent-first” mindset, where developers can interface with infrastructure using natural language prompts across familiar tools like Slack, CLI, or even LLM chatbots. Kaspar envisions a near future where developers say things like, “I want a Node.js service, with an S3 bucket and DNS, and I want it running in dev, staging, and production”—and the platform translates that request into a fully provisioned, standards-compliant environment. “That abstract request should make its way to some sort of backend… figure out the defaults… provision everything: Terraform, DNS, CI with GitHub Actions, manifests for EKS, sidecars, Rootly config, the whole thing.”

LLMs can act as a conversational interface to deterministic backends. “This will come 100%. This is something I’m super interested in. We have the deterministic backend APIs to pull this off.” However, Kaspar is quick to point out that LLMs should not be responsible for execution. “LLMs are a decent technology to select the next best step… proposing the next five lines of code… That kind of tap optimization with human in the loop.” AI, in this context, is not about replacing developers or platform engineers, but about abstracting complexity and enabling faster, safer adoption of golden paths through prompt-driven interactions.

The Limits of LLMs for Reliability and Automation

For Kaspar, LLMs in the dev tooling ecosystem are projecting a faulty image. “They’re making it feel as if it’s okay if things sort of work, but in our world—especially from a reliability perspective—that’s just not possible”.

Kaspar warns against overhyping the idea of AI agents that provision infrastructure or manage access autonomously: “These approaches and ideas that say, ‘we can have an agent that is LLM-based go off and do access management’—all of that is, in my opinion, honestly, bullshit.” The key issue is trust: “I don’t see any of the current approaches on the market being able to solve for that mandatory boundary condition—that the outcome has to be deterministic.” Reliability demands precision, and the non-deterministic nature of LLMs makes them fundamentally unfit for executing production-level workflows.

Even highly appealing use cases like generating Terraform on the fly come with serious risks. “Auto-generating infrastructure-as-code files, Terraform, etc.—terribly appealing. But the time it takes to make sure there’s no mistake… and the inaccuracy of policy management systems… make it hard for me to believe how we can do this without very, very present human supervision.” This constraint also helps explain why productivity gains from tools like GitHub Copilot don’t always scale. “At the individual level, this can help boost productivity… but at the engineering team level, they don’t see any improvement.” As Kaspar puts it, “Maybe writing code is not the bottleneck. Maybe it takes more time and energy to ensure that what’s been outputted is actually correct and reliable.”