Fear, identity and flaky tests: AI in reliability

The self-healing systems that SREs have dreamed about for a decade aren't a distant promise anymore — they're already being built, and the biggest barrier left is cultural.

Dana Lawson, CTO at Netlify, has spent over 25 years in the trenches of developer infrastructure, from sysadmin roots to running the platform that powers 5% of the internet.

In this episode, Dana makes the case that the SRE/DevOps community's resistance to AI agents isn't really about the technology — it's about identity, control, and the fear of losing the expertise that defined a generation of engineers. She doesn't dismiss that fear.

She shares a pragmatic playbook for moving through it: start with the low-hanging fruit (flaky tests, dependency upgrades, vulnerability patching), apply spec-driven development to manage non-determinism, and lead your team with empathy and a beginner's mindset. The conversation covers AI's impact on the open source vs. buy equation, what "vibe coding Fridays" looks like at Netlify, and why Dana believes leaders have a responsibility — not just an opportunity — to bring their teams into the AI era.

Are self-healing systems still a dream, or are we actually there?

We're on the precipice. I drank the Google SRE Kool-Aid like every good DevOps engineer should — that Nirvana landscape of error budgets and self-healing machines. Back then, it was still mostly talented people applying scripting and automation to approximate that behavior. Today, the dream is actually becoming reality. There are still real reservations about letting LLMs loose on production systems to do autonomous remediation, but that future started accelerating hard in the latter part of 2024. We're here now.

Why does the SRE community push back so hard on AI agents?

I get there too sometimes, honestly. I started before we had cool titles like DevOps or SRE — we were just sysadmins and network admins. And over the last decade, there's been this deep identity shift around reliability. It went from "go automate stuff and build at scale" to "you are the one who holds reliability as a top-tier feature." That's a real professional identity. So when you tell someone "now there's an AI SRE, and it's going to be in the room instead of you," it stings. Especially when you're being asked to trust something that makes decisions with much looser boundaries than the deterministic automation you understood.

The reluctance is cultural at its core. But I remind people: we evolved. We evolved from SSHing files to doing it securely. We evolved from manual ops to infrastructure as code. This is the next step in that evolution — and it doesn't take away your importance as the person who holds reliability as a core principle.

Is the fear of job loss legitimate?

Oh yeah. And I think a lot of people are lying about this right now — maybe not intentionally, but the fear is real. We're not seeing the mass wave of AI-created jobs yet. And the question of whether those will be skilled or unskilled positions matters a lot to people in our field.

Here's what I tell people: every major technology shift has displaced something. The printing press disrupted the newspaper. Smartphones disrupted dumb phones. This is the same arc. There will be disruption and displacement. But the response to that can't be paralysis — it has to be curiosity. You've built expertise over a career. You have something LLMs are modeled after. Use it. Go apply that now. Look at coding agents as an extension of yourself. That's how you maintain job security — but you have to fight for it. And if anyone tells you otherwise, I don't think that's honest.

AI won't take your job — but someone using AI might. What does that mean practically?

I apply what I call the 80/10/10 rule. Ten percent of people will say AI is the dumbest thing they've ever seen. They're wrong. Ten percent will say they can't live without it and it's changed everything. They're also wrong. The other 80% of us are based in reality — and that reality is: these tools are here, you should have some skepticism, but you shouldn't be allergic to them.

The engineers who are using AI tooling at every stage of their development flywheel — from CI/CD to co-development — are producing measurably more value. We can prove that with data. And that bar is going to keep rising. I don't care how technically talented you are; at the end of the day, it's about how much value you're adding to your customers and your product. If that value can be measured and accelerated through these tools, it's simple workplace dynamics.

Where should skeptical teams actually start?

Start with the low-hanging fruit. Stuff that's repeatable, has a solid data baseline, and won't hurt anything if it goes sideways. At Netlify, we have a monolith — like a lot of companies — and monolith maintenance can be brutal. Dependency revisions, OS upgrades, library updates. When an open source maintainer makes a critical change and you have to refactor a ton of code, that used to be a slog. Now I can send an AI agent in to make sure nothing in that release is going to break our systems. It's usually tedious, stupid work. Nobody wants to do it. That's exactly where you start.

Flaky tests are another one. Everyone has them. Everyone ignores them. And then quality slowly degrades. This is exactly the kind of repeatable, bounded task where AI thrives — and where the risk of giving it more autonomy is low. System patches, vulnerabilities, revisions, upgrades — all of it. Move from "more automation" toward "self-diagnosing and patching." That's still very much in the spirit of self-healing systems, just more accessible than it's ever been.

How is AI changing the open source vs. commercial SaaS decision for infrastructure teams?

I'm biased here — I love open source. I love that it said "you don't need to work at a company to have a good idea and build something." But I think AI fundamentally changes the build vs. buy calculus. When I started in the 90s, there were real barriers to building: you needed to know the language, know CS fundamentals, have access to books and communities. Those barriers have evaporated.

Now there are brilliant people with ideas who just couldn't manifest them technically — and AI is handing them the tools to do it. So I think we'll see a lot more teams building internal tools they would have previously bought, especially for narrower use cases. They'll probably still build on top of open source, but the need to buy a commercial product for every workflow is shrinking. Think about how we built early alerting. We just had Nagios and we built dashboards off it. We're almost coming full circle — except now the tooling to do it yourself is dramatically more accessible.

Non-determinism in LLMs is a big concern for reliability engineers. How do you manage it?

It freaks me out too. If I spin up four sessions of Claude and ask the same thing, am I getting the same output every time? Probably two out of four. And then I have to ask myself: does it matter? The architect in me says it absolutely does. The vibe coder in me says YOLO, it works.

The honest answer is there's a time and place. For systems that need determinism, use spec-driven development. Write your context files like they're carved in stone. You can constrain what the LLM can and cannot do, and that gives you back meaningful control without eliminating the benefits. For parts of your workflow where experimentation is actually good, loosen up. The last mile — before anything goes to production — is where you reinsert the human. You can't always control the output, but you can control the input. And that's where DevOps professionals still have enormous value: bless it or veto it before it ships.

We freaked out about spot instances too, remember? Stateless, ephemeral, non-deterministic. And it turned out to be fine. We found our best practices, we built confidence in the system. Same arc here. Stay paranoid where it matters, but don't let that paranoia stop you from building.

What's your advice for leaders trying to bring their teams along?

Start with self-reflection. It's really hard to stand up in front of your team and say "here's where we're going" when you don't actually believe it yourself. So do yourself the favor first — educate yourself, sit with your own skepticism, find what feels real for you. Be authentic about where you are on this ride before you ask your team to follow.

Then bring that same energy to your team, one-on-one where possible. Survey people. Get a sense of where their comfort levels actually are. Use your frontline managers. Be direct: where are you on the spectrum, and how can I enable you? Also take a step back and look at your organizational maturity. Maybe you're not ready to instrument AI across your systems yet, and that's okay — but you need to start charting that journey.

One thing we do at Netlify: we have a dedicated hour every week where anyone in the company — any department — can come hang out, vibe code, talk about tools, share what they're reading. The goal is just to take the fear out of it. Make it human and accessible. Because at the end of the day, if you believe AI is going to disrupt people's careers — and I do — then it's your responsibility as a leader to give people the space, the time, and the support to level up. That's always been what leadership is for.