Why Reliability in the AI Era Starts with the Network with Marino Wijay

Human of Reliability

Marino Wijay

Why Reliability in the AI Era Starts with the Network with Marino Wijay

🔌

Started with routers & switche

🧠

Loves deep tech reasoning

🎥

Streams weekly on YouTube

🍁

Lives in Ontario, Canada

Table of contents

Marino Wijay, Staff Solutions Architect at Kong, is a cloud networking and API expert with deep experience in Kubernetes, service meshes, and software-defined infrastructure.

Networking as the Bedrock of Reliability

In Marino’s opinion, reliability is largely about dealing with the network layer. His career began with plugging in devices, router switches which taught him that if “all of these systems need to operate together in unison,” then the network must be reliable.

To Marino, the network fabricare like roads that connect cities: “If you don't have the road to travel on, how do you get around?” Whether it's an overloaded router CPU or a switch overheating in a poorly ventilated closet, these are common failures taht directly impact reliability.

He drives home the point using a Toronto traffic analogy: "Imagine for a second that your DVP route is congested. There might be an accident there… now you have to think about an alternate path." That ability to reroute and maintain service is the essence of building resilient networks. It’s not just about performance; it’s about ensuring there is always a way forward. This framing sets the foundation for how Marino sees all other layers of reliability — starting with a robust, adaptable network fabric.

Virtualization and Software-Defined Networking

Servers were the first to undergo the virtualization transformation. Marino recalls, “Virtualization was helping us take a server, and pack more workloads, while still ensuring that we were maximizing our usage and our efficiency.” Then came the idea: “If we can do this for compute and storage, why not do it for the network as well?” That shift led to a new way of thinking: one where networks could be built on demand, abstracted away from the underlying physical hardware.

Marino explains how companies like Nicira (founded in 2007, acquired in 2012) pioneered this idea with technologies like NSX. This meant that two data centers could operate as one, in a logical sense. Developers could make systems behave as if they were on the same network, solving real-world problems like hardcoded IPs in legacy code. This evolution not only brought scale and flexibility but enabled deeper forms of network reliability through abstraction.

Service Meshes and Reliability-Enabling Abstractions

As Kubernetes gained momentum, Marino says, “It wants you to think that there is a network there… I’ll figure out everything else.” But while CNIs provided container-to-container communication, “the real magic lived further up in that OSI stack.” That’s where service meshes came in. Tools like Istio, Kuma, and Linkerd introduced “another layer of networking, pulling in different ideas of networking to create a distributed system of sorts.” They weren’t just about connectivity, they were about solving challenges like disaster recovery, availability, and multi-cluster communication.

Service meshes enabled deeper metrics and observability, granular traffic control, and even internal deployment patterns like blue greens and canaries inside of your network, that didn't even touch the outside world. Marino notes that some companies now use service meshes as the backbone for active-active setups to achieve higher levels of availability. It’s all about “a distributed control plane that says, I want to maintain desired state, I want my services to always have a pathway to communicate.” The story here is clear: abstraction enables reliability at scale, not at the cost of visibility, but because of it.

LLMs Impact on Reliability and Infrastructure

To give us a much more accurate, much more specific answer, we want LLMs to take their time. He explains how reliability considerations now include not just uptime, but thoughtful responsiveness. "Networking can slow things down and speed things up," especially when it comes to routing requests between different AI models. That’s why techniques like caching and intelligent proxying matter. “Why not cache? Why not store this in a cache somewhere, and just return a very similar response?”

AI proxies can make routing decisions based on availability. If a local model is down, reroute to the cloud. But these systems also must sanitize requests before you send them up, protecting against leakage of PII or sensitive data. Marino points out that “models aren’t able to discern that, but a proxy will be able to.” These network-layer decisions, familiar from the days of API gateways and service meshes, are now critical components of reliable AI delivery. Reliability, in this context, is about accuracy, safety, cost, and context, not just uptime.

‍

How Motive achieves 99.99% reliability with Rootly

Why Reliability in the AI Era Starts with the Network with Marino Wijay

Networking as the Bedrock of Reliability

Virtualization and Software-Defined Networking

Service Meshes and Reliability-Enabling Abstractions

LLMs Impact on Reliability and Infrastructure

Ready to get started?