Brian Shaw is an experienced engineering leader helping Uphold navigate the complex intersection of crypto and traditional finance. In this episode, he shares how his team builds resilient infrastructure to handle unpredictable market spikes, the lessons learned from major incidents, and how AI is starting to reshape reliability engineering.
1. The Unique Reliability Demands of a Crypto-Fintech Platform
“We're in an interesting place,” says Brian Shaw, reflecting on Uphold's unique position at the intersection of crypto and traditional finance. “We're subject to, to a certain extent, multiple markets,” he explains. Market fluctuations, whether from crypto or traditional finance, impact their platform significantly.
“Depending on what's happening specifically in the crypto market can affect what's happening with our user base, as well as the traditional finance world.” But it's not just passive observation. These fluctuations lead to major reliability challenges. “We can be operating pretty normal one minute… and just a certain piece of news or a certain thing that happens in the world can just trigger everything to just spike up like crazy.” In this environment, resilience isn’t optional but a core concern to Uphold’s ability to serve its customers.
2. Microservices, Kubernetes, and the Role of Cloud-Native Tools
Uphold’s ability to adapt to sudden traffic spikes depends heavily on modern, cloud-native architecture. “It kind of goes back to fundamentals,” Brian explains. “Making sure that our microservices architecture is truly a microservices architecture, that services really are decoupled.” This design allows individual components to scale or fail gracefully without taking down the entire system.
The backbone of this is Kubernetes. “We host most of our infrastructure in AWS. We do run all of our critical systems on Kubernetes.” One essential tool is Carpenter, an open-source solution from Amazon. “It helps us keep the nodes rolling, keep the nodes updating, and it really helps us scale.”
During one event, Uphold saw traffic surge from “about 200,000 requests per minute… up to about six, 700,000” within seconds. Tools like Carpenter ensure the infrastructure can grow, and shrink, dynamically to meet these extreme demands.
3. The Birth of the Platform Reliability Team
One of Uphold's biggest reliability lessons came from an unexpected market event. Brian recalls, “The SEC in the US made a certain ruling on crypto assets and the markets reacted very positively to it.” As one of the few platforms carrying that particular asset, “the entire markets came to us to start trading.” But the surge overwhelmed their systems. “Basically, we lost out on an entire day of trading that asset because our systems couldn't handle the load.”
The aftermath wasn’t just technical: it reshaped their organizational structure. “That resulted in the creation of a formal platform reliability team within Uphold,” Brian explains. The team focused first on addressing the weaknesses that were detected from that incident. Over time, they adopted principles that align most closely with the SRE principles from the Google. But Uphold takes a pragmatic approach, focusing on “metrics that mean something to us.”
4. Security and Transparency in the Finance-Crypto Hybrid Space
Operating at the intersection of finance and crypto brings unique security challenges, and Uphold embraces transparency as a core value.
“We're fully transparent with our reserves,” Brian emphasizes. “At any given time, anybody in the world can go to our website and see exactly the value of the assets that we're holding.” This approach builds trust but also makes them an attractive target. “Obviously makes us an interesting target, right? Because we're telling you what we have to get at.”
Uphold operates as a regulated entity, subject to PCI standards for card issuance, and adheres to ISO 27001 and SOC 2. But security goes beyond frameworks. “We're always conscious of how do we make sure that things are secure from the inside as well as from the outside, making sure the right people have the right access, protecting secrets, not logging into systems.” For Uphold, security is constant vigilance.
5. The Growing Impact of AI on Operations and Reliability
AI is reshaping software development. And operations teams like Brian’s are watching closely. “One of the things that I'm finding the most interesting is probably some of the AI tools and trying to use those to our benefit,” he shares. AI helps uncover hidden issues by analyzing code and infrastructure at scale. “You can point something at your infrastructure's code repo or your application repo and just say, find me typos, find me security concerns.”
But AI's influence doesn’t stop at development. It’s increasingly being applied to operations. “Trying to take a look at that from the angle of our platform reliability, maybe analyze what's happening in our environment, taking a look at the logs, taking a look at the metrics, trying to signal us the things that we might not notice otherwise.” While it's early days, Brian is optimistic AI will “help probably reduce incidents over time.”