Mariano Fernández Cocirio, Staff Product Manager at Vercel, explains why serverless architectures are hitting unexpected limits when dealing with AI workloads.
The Limitations of Traditional Serverless for AI Workloads
Mariano explains that the main problem with serverless architectures as we know them is that workloads have evolved. Traditionally, you’d use serverless functions to run quick database queries or a backend service. Everything in serverless architecture has been optimized around shaving milliseconds everywhere, with the quick response use case in mind.
But nowadays, more and more applications are using serverless functions to hit AI-backed APIs. When there is an AI workload behind a serverless function, you’ll be waiting a long while for that response to happen. And you’ll have to pay for the machine you’re allocating while you are not using the CPU. The same applies when you start streaming a response back.
This dynamic is driving up the costs of serverless processing AI workloads, making engineers look for alternatives to serverless, including traditional servers, even if they have never worked with one before.
Combining servers and serverless for AI Applications
To tackle this problem, Vercel came up with Fluid compute. In Fluid, Vercel reuses instances. Now, each instance of your traditional serverless function works as a server itself that can handle multiple invocations. At the same time, you retain the full flexibility and the ability to scale infinitely—or down to zero—just as you do with serverless.
If you don’t have any requests running, you’re not paying for it. That’s the counterpoint against servers—where you have to pay for the server all the time. If you have multiple requests or invocations, you can handle them within the same instance and scale to the next instance before experiencing degradation.
The Secret Sauce: An Advanced Load Balancer
According to Mariano, the key to overloading one instance with multiple invocations without degrading performance lies in Vercel’s load balancer that comes with Fluid.
You want to maintain the same performance and reliability that you had before in the serverless world. When you have a one-to-one instance relationship in serverless, it can sometimes be overkill.
With Fluid, this shift from one-to-one to one-to-many opens new possibilities, but Vercel must ensure that the multiple invocations fit within an instance in an efficient way.
New Approaches to Reliability: Global Dense Compute
With Fluid, Vercel is introducing several new ways to improve service reliability. One of them is the concept of Global Dense Compute, a practice used by large companies like Google, Meta, and Microsoft.
Instead of delegating computing to the network’s edge, closer to where the request originates, the request is processed closer to where the data resides. That means fewer regions but a smarter way of processing requests, ensuring data trips don’t add unnecessary latency and that zone failovers become less of a risk.
Other strategies to improve Fluid’s reliability include pre-warmed functions and bytecode caching by default for all customers.
Developers: back to (global) state affairs
Mariano is optimistic about Fluid’s impact on developers. You can focus on developing features and delivering business value instead of juggling different types of infrastructure or figuring out how to run applications more efficiently.
However, given Fluid’s shared instance nature, it also means departing from the clean-state model that serverless previously provided. Developers must now be mindful that any invocation can mutate the global state of the server instance.