Back to Blog
Back to Blog

October 30, 2025

6 mins

When Nothing Changes and Everything Breaks: Why Machine Learning Fails Differently

Why 50% of companies don't monitor ML and how it’s reshaping our understanding of reliability.

Jorge Lainfiesta
Written by
Jorge Lainfiesta
When Nothing Changes and Everything Breaks: Why Machine Learning Fails DifferentlyWhen Nothing Changes and Everything Breaks: Why Machine Learning Fails Differently
Table of contents

During the pandemic, models across industries started failing overnight: forecasts went wild, recommendations broke. Machine learning teams couldn’t figure out what had happened. The code hadn’t changed, all checks across all dashboards were green. What had changed was the data getting fed to the models. Nobody had seen this before, so models went wild without ringing any alerts.

Most system failures announce themselves loudly: alerts go off, dashboards spike, and logs light up with red. Machine learning systems are different. They fail quietly, sometimes for weeks as it happened to Anthropic this year.

We recently sat down with Maria Vechtomova, one of the most influential voices in MLOps in the industry, to talk about why machine learning systems fail in ways traditional systems never do. As she put it, sometimes nothing changes and everything breaks.

MLOps make models production-ready

MLOps is, at its core, a set of practices designed to bring machine learning models into production reliably and efficiently. In theory, anything that runs on a schedule and delivers business value could be called “production.” You could, for example, schedule a jupyter notebook to run daily and deliver results to end users. But that’s not MLOps best practices because it’s not reliable.

“You can technically make anything production, right? Even scheduling a notebook that runs daily. But that’s not what we consider best practice, because it’s not reliable.”

If someone changes code in that notebook without version control, the results may change and no one will know why. Even when the code and infrastructure are exactly the same, the data distribution might shift, and the model could start producing very different results. “That’s why traceability and reproducibility,” Maria argues, “are probably the biggest principles of MLOps.

Same Code ≠ Same Result

Traditional software engineering operates on the comforting assumption that if you run the same code on the same infrastructure, you’ll have a predictable output. If your code, data model, and environment don’t change, you expect the same results.

Machine learning doesn’t work like that. For ML systems, the statistical properties of the data might have shifted and suddenly your model doesn’t perform anymore.

“We saw that during COVID. Everything broke. Not because of bad code, but because the world changed and the data no longer looked like what the model had seen before.”

During COVID, demand forecasting, recommender systems, predictive analytics all started failing, inexplicably. not because of bad code, but because the patterns of human behavior that models relied on shifted overnight.

The Monitoring Gap

Despite years of progress, the MLOps tooling is still far from being mature. Monitoring at ML systems is just hard. The complexity of monitoring machine learning systems goes far beyond traditional reliability metrics. Uptime and latency still matter, but they tell you nothing about whether a model is still making sense.

“We are nowhere close to maturity. A survey by the Ethical Institute of AI showed that 50% of companies don’t have monitoring in place for their ML applications.”

The problem, Maria explains, is that no one fully agrees on what to monitor. “You can monitor standard properties of software systems, but on top of that you need to monitor data drift, model drift, and all those aspects. Even if there is data drift, it may still not mean anything right for you.”

Some tools exist for simpler use cases like regression or classification, and more are emerging for large language models. But for complex applications, like recommender systems, decision engines, or domain-specific AI, most teams still have to build custom solutions from scratch.

When “200 OK” Isn’t Okay

In web applications, a “200 OK” response is comforting proof that everything’s working as intended. Machine learning, on the other hand, doesn’t care about your status codes. The system might be online, the API might be returning predictions, but the results themselves could be completely off.

“In machine learning,” Maria says, “you might get a ‘200 OK,’ but your model’s results could still be completely wrong. You don’t know how bad it is until you have the ground truth.” Only after comparing predictions to real-world outcomes do you see how your model is actually performing. Until then, all you have are estimations and heuristics, educated guesses at best. “you often don’t know your system is failing until much later,” she concludes.

Catching Silent Failures

Perhaps the most unsettling aspect of machine learning systems is their ability to fail quietly. No crashes, no error logs. Just a slow drift away from useful outputs. Even with perfect code and stable infrastructure, models can degrade silently as their input data evolves.

You can have perfect code and stable infrastructure, but if the data shifts, the model silently drifts away from reality. That’s what makes MLOps so important: it’s about building systems that can detect those silent failures before they reach your users

This is the essence of MLOps maturity: building observability and feedback loops that connect models to the world they interpret. Machine learning doesn’t break loudly, it drifts quietly. The job of ML Ops is to catch that drift before it turns into damage.

Quality Incidents: a new frontier?

SREs traditionally took care of running systems through clear failure signals: latency spikes, error rates, saturation. In ML applications, the signals are less deterministic and often lag behind the impact. Incidents in the AI era aren’t about service outages, it’s a slow erosion on the model’s output quality.

Treating data drift as an incident domain reframes the challenge: instead of monitoring only infrastructure, we monitor the relationship between the system outputs and “good” outputs. Reliability, in this new world, isn’t just about uptime anymore. And MLOps, when done right, is how we keep our systems aligned with the outputs they’re designed to provide.