In this episode, Julien Simon brings a grounded perspective to the chaos of modern GenAI systems. Despite the hype, he argues that reliability fundamentals haven’t changed: hosted LLMs are still cloud APIs, subject to the same failure modes SREs know well. We dig into why identical models behave differently across providers, the realities of fast-moving open-source stacks, the rising role of ML Ops, and how enterprises navigate the open-source vs. closed-source debate.
Key Topics Discussed
Reliability Challenges in Hosted LLM APIs
- SLAs, uptime, latency, throughput, throttling
- Treating LLMs as “just another cloud API”
- Security risks when sending sensitive data to external providers
Why the Same Model Behaves Differently Everywhere
- Different inference servers (VLLM, SGLang, Llama.cpp)
- Batching strategies and provider-level optimizations
- Quantization, compilation, custom CUDA/PyTorch stacks
The Fluid and Fragmented LLM Stack
- Lack of standardization in open-source tooling
- Why “chasing the latest model” is a trap
- Julien’s wave philosophy for stabilizing production systems
Open Source vs. Closed Source Models
- Why enterprises increasingly lean toward open-weight models
- Performance, privacy, cost, and deep domain alignment
- The role of community fine-tuning (2M+ models on Hugging Face)
The Rise of GenAI Ops
- AI apps as distributed systems: RAG, agents, guardrails, caches
- New operational skills emerging around LLM-based architectures
- What ML Ops really means in practice
Building Reliable AI Systems at Scale
- The danger of sexy but unstable tooling
- How to think about multi-model architectures
- Why reliability still has to win over novelty
What reliability challenges come with using hosted LLM APIs?
It’s never different this time. And you’re going to hear me say that a lot. If you’re working with hosted APIs like OpenAI, Anthropic, or Amazon Bedrock, at the end of the day you’re still working with a cloud-based API. All the usual concerns apply, and more.
You need to worry about uptime. Maybe you get an SLA, maybe you don’t. You need to worry about latency, throughput, parallel queries, throttling (soft and hard limits), whether those limits are public or real, and whether you’ve actually tested them. Load testing, all that good stuff.
And you need to add security to the picture because you’re going to send these APIs all your good data: confidential data, customer data, PII. Treat these as vanilla APIs. Don’t let the AI magic blind you: it’s an API running in the cloud, and it can break and fail and slow down in every possible way. Start there. Test everything. And have a backup strategy if the API fails.
Why does the same model behave differently on different providers?
It’s a real thing. You’d imagine the same Llama model should behave the same everywhere, but that’s not how it works. Even if two providers start with the same model artifact from Hugging Face, the deployment parameters will be different.
Somebody will use VLLM, somebody else SG Lang, somebody else Llama.cpp, all of which have a zillion parameters that influence text generation. Batching strategies matter a lot too, because providers want throughput.
And then they all modify things. They run at scale, so they use custom versions of these servers with extra tweaks to squeeze performance. They also quantize or compile the model, and every quantization algorithm has different parameters. Add different PyTorch versions, CUDA drivers, and so on, it’s a complicated stack.
Everybody has good intentions: better performance, better throughput, better price. But things end up being different. So if you think you can switch providers and get the same output because “it’s the same model,” you’ll be disappointed. As always: test, test, test.
Are LLMs just nondeterministic by nature?
They’re trained to be creative. If that’s not what you want, maybe you don’t need to use GenAI. If you want deterministic, stable answers, use older transformer models where you have more control. You can’t blame modern LLMs for being variable — that’s how we train them.
Will LLM tooling ever standardize like containers did for traditional software?
Honestly, I don’t know. These are very complex stacks, more complex and more fluid than what we had before. Yes, there are specific ML engineering skills, but at the end of the day, it’s still DevOps: building the runtime environment that works best for your company.
Right now the stack is extremely fast-moving. New model architectures, new engineering tricks. They all land quickly in tools, and you get tiny and not-so-tiny differences everywhere. Will we ever get an AI equivalent of the LAMP stack, or the “Kubernetes + Helm + Prometheus” stack? I don’t know.
And here’s something I keep repeating: if you use the same model, same stack, same GPUs as everyone else, where’s your competitive advantage? We’re far from commoditization. What’s the S3 equivalent for AI? I don’t know. Maybe we find out in 10 years. But right now, it’s going to stay fluid.
Should reliability teams limit complexity, or embrace the chaos?
If you’re a startup looking for product–market fit, break everything every week if you want: you don’t have customers yet. But once you get closer to production, reliability matters. Production systems need to be robust and long-lived. Nobody deploys something intended to live for a week.
Experimentation is fine early, but heading into production, you need to ask: Can I live with this thing for six months? Two years? Because a lot of companies do plan deployments that should run 24/7 for years. That’s just how they operate.
If you bet on sexy open-source projects that become messy, unmaintained, or chaotic later, that’s not the right bet for production.
One way I like to work is a wave philosophy. Each project is a wave that moves for about six months toward production. You don’t change your stack two weeks before launch. Maybe six months later you get another shot, that’s the next wave. It’s iterative. Don’t pause everything because a new model came out yesterday. Ship what you have, get feedback, and fold it into the next release.
How do you think about open-source vs. closed-source LLMs for enterprises?
I worked at Hugging Face for three years, and Arcee also shares models openly with Apache 2 licenses. So you see where I stand.
ChatGPT was amazing. It educated everyone about what GenAI and chatbots could do. But once the excitement passes and you try to move to production, you realize the performance isn’t always amazing. Generation speed matters a lot, especially if you generate code. Thirty seconds versus five minutes is a big deal.
Privacy and security really matter outside the US. If I go to Singapore or the UAE, people question why they should send confidential data to another country. Reverse the situation and it makes sense.
And large models are a mile wide and an inch deep. Enterprises often need the opposite: very narrow but very deep domain knowledge. They need a model that knows their domain, not astronomy or cooking or poetry. Hosted models don’t know your private data, which is a good thing. But it also means they lack depth.
So performance, privacy, domain knowledge, and cost drive people toward open-weight models. You understand what you’re working with. You can self-host if needed. You can train or fine-tune. And you have more control.
What are “small” or open-weight models, and why do they matter?
Open-weight models are what you find on Hugging Face: over two million models now. They’re trained by the open-source community and big tech companies. Nvidia, Meta, Microsoft. They all build and release open-weight models.
It’s still expensive to train these models, but the value comes from the community. Hundreds of thousands of people take these base models and fine-tune them, specialize them, quantize them, optimize them. That’s how you get two million models.
So you might take an Arcee model, then fine-tune it on a legal dataset or your own internal documents. Now you have your model. You can share it or keep it. And because so much of this work is public, customers often find models off the shelf that already solve their problem.
Why do companies use open-source models but still rely on hosted APIs?
It comes down to whether you want to run your own infrastructure. If you run at scale, maybe the cost of on-prem servers or cloud GPU instances makes sense. Maybe you already have the team.
But many customers don’t want that pain. They’d rather call an API. They just want the model behind the API to be something they understand, something less of a black box, something cheaper.
Hosted Llama, hosted Mistral, hosted Qwen: they’re smaller and usually cheaper than top-of-the-line closed models. And you can experiment in-house with the same model you later deploy via API. That’s a nice workflow.
Are we seeing new roles emerge for GenAI operations?
Yes, some specific skills are emerging. A model is just an endpoint, sure, but an AI app is much more than a model. If you’re building agents, you have several models chatting. If you’re doing RAG, you have vector databases or equivalent systems feeding retrieval into generation. You might have guardrail models. Caching, prompt rewriting, prompt libraries — some people use all of that.
All of this is a distributed system, and keeping it fast, reliable, and observable is a lot of work. That’s where the new skills are. My definition of ML Ops is understanding and operating the workflow end to end, not just hosting the model.

.avif)






















