99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Tomás Hernando Koffman (Not Diamond)

Tomás Hernando Koffman
🎯
Chases 99%+ accuracy
🧱
Thinks prompts are architure
🔄
Battles model drift daily
📊
Builds evals before features

Listen on Spotify and Apple Podcasts!

Table of contents

Building on LLMs means accepting a new kind of instability: non-deterministic behavior, silent model updates, and rapid deprecations. In this episode, Tomás Hernando Koffman (Co-founder of Not Diamond) explains how teams can still reach 99%+ accuracy by applying an SRE mindset to LLM systems: treating prompts, evaluations, and workflows as first-class reliability components.

Key Topics Discussed

  • Why model churn and deprecations are a core reliability risk for LLM-based systems
  • What “good enough” accuracy really means — and when 80% stops being acceptable
  • Applying SRE principles to LLM applications: evaluations, golden datasets, and metrics
  • Prompts and instructions as part of the system architecture, not just inputs
  • Why manual prompt tuning doesn’t scale — and how prompt optimization changes the game
  • Designing workflows that stay stable even as the underlying model evolves

Why is reliability harder with LLMs than traditional software?

The core issue is that LLMs are non-deterministic by nature. Even if you give them the same input, you can still get different outputs. That already makes things harder than classical software. But then you add model updates, silent changes, and deprecations, and suddenly the behavior of your system can shift without you changing anything in your code.

When you put LLMs into workflows, those small variations compound very quickly. A tiny drop in accuracy at one step can cascade into much larger failures downstream, especially when the system is used at scale.

What does “good enough” accuracy actually mean in production?

It really depends on the use case. For things like creative writing, brainstorming, or internal tooling, 80% accuracy might be totally fine. But the moment you’re building something user-facing, or something that feeds into another automated system, the bar changes dramatically.

In many real production systems, you actually need something closer to 95% or even 99% accuracy. Below that, the operational overhead becomes too high. Humans need to step in too often, trust erodes, and the system stops being useful.

Why do model updates and deprecations create reliability debt?

The problem is that models change underneath you. Providers update them, improve them, or deprecate them entirely, and those changes can be subtle. Sometimes accuracy improves, sometimes it gets worse — but either way, your evaluations can suddenly be invalid.

That means teams are constantly paying a reliability tax. You’re re-validating behavior, re-tuning prompts, and re-checking workflows just to stay at the same level of performance. Over time, that becomes a form of technical debt that’s specific to LLM systems.

How should teams think about prompts and instructions?

Prompts aren’t just text — they’re part of the system architecture. You have system prompts, user inputs, dynamic context, instructions, guardrails, tools, and sometimes multiple models interacting with each other.

If you treat prompts as something you just “hack” until it works, you’ll hit a ceiling very fast. Once systems get complex, you need structure, separation of concerns, and clear interfaces — the same things we’ve learned to apply in traditional software engineering.

What does an SRE-style framework for LLM reliability look like?

It starts with evaluations. You need golden datasets and metrics that actually reflect whether the system is doing the right thing. Accuracy isn’t always a simple number — sometimes you need semantic or task-specific evaluations.

From there, you design workflows that are resilient. You control context carefully, reduce unnecessary variability, and think about failure modes upfront. The model is just one component in a larger system, and you need to engineer around its weaknesses.

Why doesn’t manual prompt tuning scale?

Manual prompt tuning works at the beginning, when you’re experimenting. But as soon as you need high accuracy across many tasks, it breaks down. You’re only exploring a tiny fraction of the possible prompt space, and improvements become incremental at best.

This is where automated prompt optimization comes in. Instead of guessing, you systematically explore variations and evaluate them against real metrics. That’s how you get large jumps in accuracy, sometimes even close to 100% on specific tasks.

What does the future of prompt optimization look like?

Prompt optimization is becoming its own discipline. As models continue to change, the ability to automatically adapt prompts and workflows will be essential. Otherwise, every model update becomes a fire drill.

Long term, teams that treat optimization, evaluation, and monitoring as first-class concerns will be the ones that can actually trust their LLM systems in production.

Where to Find the Tomás