You Can’t Fix What You Don’t Measure: Observability in the Age of AI with Conor Bronsdon

💡

Writes a Substack

🎙️

Hosts Chain of Thoughts

🌍

Loves AI conferences

☕

Coffee > Code

Table of contents

Conor Bronsdon is the Head of Content and Developer Awareness at Galileo, where he helps teams build more reliable, transparent AI systems through better evaluation and observability.

1. Why Observability Is the Foundation of Reliable AI

There’s this old saying in the reliability world,, “you can only fix something if you can measure it.” Conor Bresson argues the adage still applies to AI, but it’s tricker. “If you’re not observing your AI systems,” he says, “you are definitely not setting yourself up for success.”

Without observability, even understanding why a model produces the wrong answer is impossible. “If you don’t have monitoring systems in place, you can’t even begin to truly unpack that because if you’re just throwing a question to an LLM, it’s kind of a black box.”

Conor’s bottom line: observability isn’t optional. It’s the foundation for reliability. Just as SREs can’t diagnose an outage without logs or metrics, AI teams can’t improve systems they don’t measure.

2. Redefining Metrics for AI Performance

Traditional monitoring was binary. You could look at a web app and understand the health by looking at the number of 200s, 400s, and 500s. But with AI workloads, the signals are far more complex: how do you know that the summaries generated today are better or worse than the ones yesterday?

Galileo’s evaluation framework includes metrics like context adherence, completeness, chunk attribution, and chunk utilization. They’re all designed to help developers understand not just whether an LLM is “working,” but how.

He also highlights new metrics for agentic systems: “Did the action the agent takes actually advance the goal that was set out by the user, or was this a sidebar?” It’s a shift away from yes/no correctness toward measuring progress, context, and quality of reasoning, a more human way of evaluating machine behavior.

3. Human Feedback and Guardrails: Still Essential

Even as machine learning advances, Conor insists that human oversight remains essential. “We can’t fully offload this to machines,” he says. “It’s great that you can generate a synthetic dataset, but it’s crucial for us to enable human feedback.”

That feedback comes in many forms: data labeling, SME reviews, or even lightweight few-shot examples that help the model extrapolate improvement across wider datasets. “We see that continuous learning through human feedback, that auto-tuning of evaluators and metrics, provides quite a bit of benefit,” he says.

Conor also warns about skipping safety measures in production. “If we’re not actually effectively stopping bad actions in production, then we have a major gap.” Especially in regulated industries like banking, he says, multiple guardrails are essential: “You should have an input guardrail to detect prompt injections, and another on the other end to check whether the response is leaking personal data or doing something wrong.”

4. From Test-Driven to Evaluation-Driven Development

For deterministic systems, test-driven development (TDD) was the standard. For AI, Conor says it’s time for something new: evaluation-driven development (EvDD). “It’s the methodology of building for AI and LLM-based applications that prioritizes continuous evaluation of responses and everything happening in the system,” he explains.

EvDD replaces static unit tests with continuous metrics that guide iteration. “It involves defining success criteria, creating a set of evaluations to measure performance, and using those results to iterate.” Galileo customers like Airbnb, he adds, are already building with this approach.

Central to EvDD is the concept of LLM-as-a-judge: using one model to evaluate another. “You can have a model generate responses and another model spot-check them for accuracy, hallucination, or completeness,” he says. “You can even have a panel of judges (OpenAI, Claude, open-source Llama) and weight their scores differently.” It’s a meta-layer of accountability that helps AI systems learn faster, safer, and with measurable reliability.

5. The Rise of Small Language Models and the Future of Reliability

As AI systems scale, latency and cost can balloon. “There are huge latency challenges, accuracy challenges, cost challenges,” Conor admits. “If you look at a simple RAG chatbot, it’s really easy for that cost to balloon over time.”

Galileo’s answer is small language models (SLMs): compact, fine-tuned systems that can act as real-time evaluators and guardrails. “We developed our Luna 2 family of small language models,” he explains. “They’re lower latency, more cost effective, and designed to work in production as judges.”

He likens these models to “trained subject matter experts who are on call for you.” When combined with human spot checks and continuous learning, they enable scalable reliability. “If you want to scale,” Conor concludes, “you need basic observability and evaluations in your application from the beginning. You can’t fix what you don’t measure.”