Back to Blog
Back to Blog

September 23, 2025

15 mins

2025’s Top 50 People Making the World More Reliable

The Reliability Top 50 honors those who keep our ambitious systems running, translating SLOs into uptime, transforming postmortems into industry standards, and teaching us all how to fail more gracefully.

JJ Tang
Written by
JJ Tang
2025’s Top 50 People Making the World More Reliable2025’s Top 50 People Making the World More Reliable
Table of contents

When a new AI frontier model drops or EV company pushes an autonomy update, the headlines focus on breakthroughs and valuations. But the real test is quieter: does it work when millions log in at once, when GPUs fail mid-inference, when a global outage ripples across the internet? The magic only lasts because reliability leaders make sure it does.

Welcome to the Reliability Top 50: the SREs, infrastructure leaders, and incident commanders who keep our most ambitious systems resilient. They span AI companies like Anthropic and Mistral, hardware giants like NVIDIA and Cerebras, cloud platforms like Google and Microsoft, and enterprise leaders like Okta, Twilio, and Salesforce. Some build the pipelines that make inference possible; others lead response teams when things inevitably go wrong. Together, they make sure scale never comes at the expense of stability.

What This Year’s List Shows

  • Reliability is cross-disciplinary: this year’s honorees include not just SREs but also incident commanders, quality leaders, and even customer support directors, reflecting how resilience now cuts across the entire org chart.
  • AI is mainstreaming reliability: frontier labs, voice AI startups, and GPU providers all treat SRE as a first-class discipline, not an afterthought to deal with after research is “done.”
  • Incidents define culture: From Twilio to Tesla to theScore, leaders here run the playbooks that keep global services steady under stress, and write the lessons the rest of the industry will copy.
  • Hardware meets software: wafer-scale chips, DGX clusters, and GPU clouds demand reliability practices once reserved for telcos and hyperscalers.

Why This Matters

Modern tech is a tower of dependencies: AI models on orchestration frameworks on GPU clouds on global networks. Remove one link and the whole illusion crumbles. The Reliability Top 50 honors those who keep the tower upright, translating SLOs into uptime, transforming postmortems into industry standards, and teaching us all how to fail more gracefully.

These are the people who make “always-on” possible. This is their year.

50 People Making the World More Reliable

Manish Thakrani (Abnormal Security)

After more than five years scaling identity and access systems at Twilio, Manish Thakrani now leads Cloud Infra and SRE at Abnormal Security, an AI-powered email security platform protecting enterprises and government organizations against advanced threats. His teams focus on observability, incident response, and hitting four nines of availability across commercial and federal deployments.

Joseph LaCava (Anduril Industries)

Anduril Industries develops cutting-edge defence technology, from autonomous systems to AI-powered defence platforms. Reliability is central to that mission, ensuring these systems operate under the harshest conditions. As SRE Manager, Joseph LaCava leads efforts to keep Anduril’s platforms resilient at scale. He previously spent nearly three years at Devo, where he built and led teams delivering cloud-native observability and analytics for enterprise customers.

Todd Underwood (Anthropic)

When you’re running AI at frontier scale, reliability can’t be an afterthought. Todd Underwood leads Reliability at Anthropic, making sure some of the world’s most advanced foundation models run safely and smoothly. Before this, he spent more than a decade at Google and a year at OpenAI, helping teams scale the massive systems behind global products and cutting-edge research.

Alex Palcuie (Anthropic)

Taking AI from powerful to reliable is one of Anthropic’s central challenges. Alex Palcuie is part of the AI Reliability Engineering team, where he focuses on making Claude more resilient, starting with inference. He previously spent more than eight years at Google, managing large-scale SRE teams responsible for GPU provisioning, observability, and infrastructure control across Google Cloud Platform.

Sujay Jayakar (Anysphere)

Anysphere is rethinking developer tooling with Cursor, its AI-native code editor built to scale with ambitious engineering teams. Sujay Jayakar leads Infrastructure Engineering, bringing with him deep experience from co-founding Convex and earlier roles at Microsoft Research and Dropbox, where he spent eight years as Principal Engineer.

Tammy Butow (Apple)

After leading SRE teams at Dropbox and Gremlin, Tammy Butow joined Apple in 2023, where she now works in Traffic Engineering on load balancing and networking at massive scale. Her career has spanned startups and enterprise, with a track record of making critical systems dependable for both customers and developers.

Dan O’Boyle (Bolt.new)

Dan O’Boyle is a Senior DevOps Engineer at Bolt.new, where he focuses on scaling reliability and infrastructure for fast-growing developer platforms. Before Bolt, he spent nearly three years at Reddit as a Staff Site Reliability Engineer, supporting both ML platform and security teams, and has built his career around solving reliability challenges in complex distributed environments.

Georgios Sarakakis (Cerebras)

Cerebras Systems is pushing the limits of compute with wafer-scale AI processors, where reliability is mission-critical. Georgios Sarakakis is Vice President of Quality & Reliability, building the systems and processes that keep next-generation AI hardware dependable under extreme performance demands. He previously held senior leadership roles at Rivian, leading reliability and quality engineering, and at Apple, where he directed reliability operations for iPhone and core technologies.

Saurabh Baji (Cohere)

Saurabh Baji is CTO at Cohere, where he leads engineering, product, and ML to deliver generative AI that performs reliably at enterprise scale. Since stepping up from SVP of Engineering, he’s focused on advancing Cohere’s generative and embedding models while hardening the platform for large customers; previously, he spent nearly three years at Unity leading AI & Data, building ML systems at global scale.

Dan Slimmon (D2E)

After nearly five years as a Staff SRE at HashiCorp, Dan Slimmon founded D2E, a hands-on reliability consultancy that helps teams tighten incident practice, reduce toil, and ship resilient systems. He’s since paired deep SRE experience with pragmatic coaching, most recently also doing infrastructure work at Clerk.com, to help organizations turn reliability into a habit, not just a goal.

Andrew Fong (Databricks)

When data and AI platforms grow fast, reliability has to scale with them. Andrew Fong is Senior Director of Engineering at Databricks, focusing on service foundations and large-scale infrastructure. He joined via the acquisition of Prodvana, where he was co-founder and CEO; earlier, he led Dropbox’s global Infrastructure organization, including reliability and core platform teams.

Angelos Perivolaropoulos (ElevenLabs)

ElevenLabs powers AI voice for a rapidly expanding set of products and creators, where reliability has to keep pace with demand. Angelos Perivolaropoulos engineers infrastructure and product reliability to keep those systems dependable in production. Before ElevenLabs, he held reliability and platform roles at Beacon Platform and—over a two-year stretch—at Ondat, with earlier experience at Skyscanner.

Ankita Gandhi (Glean)

Glean builds enterprise search and knowledge products that have to be fast, accurate, and always on. Ankita Gandhi is a Site Reliability Engineer focused on keeping those services running smoothly at scale. She previously spent more than two years at Goldman Sachs in SRE (latterly as a Vice President), and three years before that in Apple’s Service Reliability Engineering group.

Brian Delahunty (Google Cloud)

Brian Delahunty is VP of Engineering in Google Cloud AI, leading AI Agent Engineering. He joined Google in this year after nearly two years at Anthropic, where as Head of Engineering he built out the org spanning inference, infrastructure, and product, and helped launch the Claude 3 model family that took the company from zero to over $1B ARR in under 12 months.

Vishakha Sadhwani (Google Cloud)

Google Cloud supports some of the largest AI workloads in the world, where infrastructure has to be both scalable and reliable. Vishakha Sadhwani is a Senior Cloud Architect in Strategic AI, specializing in compute, storage, and networking for training and inference. She partners with engineering and customer teams to design AI-optimized infrastructure, drawing on more than three years of experience as a Cloud Architect at Google.

Brent Chapman (Great Circle Associates)

Brent Chapman is the founder of Great Circle Associates, where for more than 30 years he has helped organizations prepare for and learn from emergencies, bringing a deep background in IT infrastructure and SRE. Alongside his consulting, he’s held senior incident management roles at Google and Slack, and most recently at Atlassian, where he worked on scaling incident processes for global engineering teams.

Charity Majors (Honeycomb)

Charity Majors is co-founder and CTO of Honeycomb, the observability platform built for teams that need to understand and debug production software in real time. Since starting Honeycomb in 2016, she’s become a leading voice on observability and resilience, advocating for engineering teams to focus on context and outcomes rather than noise and vanity metrics.

Adrien Carreira (Hugging Face)

Adrien Carreira leads infrastructure at Hugging Face, where scaling systems to support the AI community’s open-source models and platforms is a core challenge. He became Head of Infrastructure in 2024 after more than two years as Infrastructure Tech Lead, guiding the company through rapid growth in usage and platform adoption.

Balaji Kannan (Informatica)

Balaji Kannan is Director of Cloud DevOps, Operations & SRE at Informatica, leading teams that drive efficiency, scalability, and observability across cloud platforms. Over more than two decades at Informatica, he’s grown through engineering leadership roles, and today focuses on continuous deployment practices and operational excellence at enterprise scale.

Dylan-Daniel Page (Lambda)

Lambda powers compute infrastructure for AI research and production. Dylan-Daniel Page manages Core Infrastructure there, helping scale GPU clusters and cloud-native systems for demanding ML workloads. He’s also an active contributor in the CNCF community, serving as Co-Chair of the Infrastructure TAG and maintaining open-source projects like Atlantis.

Quentin Brosse (LangChain)

LangChain has become a key developer framework in the AI ecosystem, where reliability matters as usage scales. Quentin Brosse is a Platform Engineer focused on scaling LangSmith. Before joining LangChain, he spent four years at Nexthink in software and platform engineering, driving AI innovation and enhancing large-scale data platforms.

Sandhya Ramu (LinkedIn)

Sandhya Ramu is Senior Director of Site Reliability Engineering at LinkedIn, where she has spent nearly a decade leading reliability for some of the internet’s most widely used professional networking services. She oversees large-scale systems that power data, infrastructure, and operations across the platform. Before stepping into this role, she directed and managed LinkedIn’s Data Services Operations, giving her deep experience in scaling mission-critical systems that millions rely on daily.

Parneet Kaur (Lyft)

Parneet Kaur is a Software Engineer at Lyft, where she’s spent more than four years building and maintaining infrastructure powering ride-sharing at scale. Earlier, she worked at Oracle as part of the technical staff, developing provisioning tools to connect Terraform to Oracle’s cloud infrastructure.

Daria Barteneva (Microsoft)

Daria Barteneva is a Principal Site Reliability Engineer on Microsoft Azure, with more than five years of experience driving reliability across one of the world’s largest cloud platforms. She is also active in the SRE community as a longtime member of the USENIX SREcon steering committee, helping shape the conversation on reliability practices across the industry.

Devon Mizelle (Mistral AI)

Mistral AI is building next-generation open-weight models where performance and reliability need to scale together. Devon Mizelle is a Senior Site Reliability Engineer ensuring those systems stay dependable in production. He joined from Humane, where he led SRE efforts for nearly three years.

Thiara Ortiz (Netflix)

Thiara Ortiz manages Cloud Gaming Site Reliability Engineering at Netflix, building systems that keep game streaming as seamless as video. Over five years at Netflix, she’s worked across CDN reliability and now leads efforts in cloud gaming, shaping how interactive experiences scale to millions of users. She’s also shared her expertise at SREcon, presenting on service quality and internet latency measurement.

Dana Lawson (Netlify)

Netlify powers modern web development for millions of developers, where reliability and speed define the experience. Dana Lawson is CTO, guiding the company’s engineering vision after serving as SVP of Engineering. Before joining, she was VP of Engineering at GitHub, bringing experience from leading teams at scale in developer platforms.

Hazel Weakly (Nivenly Foundation)

Hazel Weakly is a Fellow at the Nivenly Foundation, advancing sustainable governance for open source communities. She also spent time at Datavant as Principal Architect and Engineering Manager, driving platform engineering and organizational growth. Through Nivenly, she champions equitable futures in tech by supporting underrepresented contributors and building community-centered governance models.

Lex Neva (NVIDIA)

Lex Neva is Principal Software Engineer in Reliability and Operational Excellence at NVIDIA, working on infrastructure for large-scale AI and GPU systems. He’s also the curator of SRE Weekly, a long-running industry newsletter that highlights outages, incident analysis, and reliability practices. Before NVIDIA, he was a Staff SRE at Honeycomb, bringing hands-on production experience to observability and distributed systems.

Disney Lam (NVIDIA)

Disney Lam is Senior Director of AI Infrastructure Engineering at NVIDIA, leading efforts to scale DGX Cloud and AI compute systems. She previously held senior roles at Cruise, Facebook, and Google, bringing more than a decade of experience in production engineering and site reliability.

Terra Field (NVIDIA)

Tera Field is a Senior DGX Cloud Software Engineer at NVIDIA, focused on infrastructure automation and distributed systems for large-scale AI workloads. Before joining NVIDIA, she spent nearly three years at Honeycomb as a Staff Platform Engineer, where she drove reliability improvements across observability platforms.

Merisa Lee (Okta)

Merisa Lee is Senior Director of Defensive Cyber Operations at Okta, overseeing security operations to protect identity and access management at enterprise scale. She joined from Cisco Meraki, where she led threat response and security engineering across the network platform business, managing teams of engineers and SOC analysts to deliver resilient defenses.

Jason Dixon (Oracle)

Jason Dixon is Senior Manager of AI Infrastructure Observability at Oracle, where he leads teams building custom observability tooling for AI and GPU infrastructure at scale. His work ensures customer workloads run reliably across OCI, balancing performance, usability, and the demands of modern cloud platforms. Jason is also the founder of Monitorama and created Monitoring Weekly, two influential projects in the observability community.

Tony Wu (Perplexity AI)

Perplexity AI is rethinking how people interact with information, and Tony Wu is helping scale the engineering behind it as VP of Engineering. Previously with OpenAI, Meta, and Uber, Tony has led teams working on training data management, labeling platforms, and large-scale experimentation systems. His experience in both infrastructure and applied AI engineering makes him central to driving Perplexity’s product innovation forward.

Ertan Dogrultan (Replit)

Replit powers collaborative software development at scale, and Ertan Dogrultan leads its engineering efforts as Director of Engineering. His remit spans platform infrastructure, cloud services, SRE, and AI systems, ensuring the developer experience scales with Replit’s rapid growth. Previously, Ertan held senior engineering leadership roles in fintech and has mentored up-and-coming engineering leaders through First Round Fast Track.

Jiliang Zhang (Rivian)

Rivian is reshaping the future of electric vehicles, where reliability isn’t optional but mission critical. Jiliang Zhang, Principal Reliability Engineer, brings over a decade of experience ensuring complex hardware systems meet the highest standards of dependability. Prior to Rivian, he worked at Apple and Tesla, focusing on reliability analysis, validation testing, and quality engineering for cutting-edge consumer and automotive products.

Sylvain Kalache (Rootly)

At Rootly, reliability isn’t just about responding to incidents, it’s about rethinking the future of reliability with AI. Sylvain Kalache leads AI Labs, where his team develops prototypes, open-source tools, and research to advance reliability standards. A former SRE at LinkedIn and SlideShare, Sylvain also co-founded Holberton School, training thousands of software engineers worldwide, and continues to bridge engineering rigor with forward-looking innovation.

Sarah Butt (Salesforce)

Sarah Butt is Principal Engineer for Centralized Incident Response at Salesforce, where she focuses on building resilient systems and response strategies at enterprise scale. She previously served as Director of Site Reliability Engineering at SentinelOne and spent several years at Salesforce in senior engineering roles. Sarah’s expertise spans incident response, AWS infrastructure, and engineering management, making her a trusted leader in the reliability space.

Coby Adams (SambaNova Systems)

SambaNova Systems develops advanced AI hardware and software systems, designed to push the limits of performance for large-scale AI workloads. At the center of this, Coby Adams directs customer support, managing global teams that keep enterprise clients running smoothly. With prior leadership experience at Disney and nearly two decades at Oracle, he brings deep expertise in incident management and operational excellence to supporting AI at scale.

Brandon Chalk (Independent Security Engineer)

Brandon Chalk is an independent security engineer focused on digital forensics and incident response. Until recently, he led critical incident response at Databricks, steering the Security Incident Response Team (SIRT) through high-impact events across one of the world’s fastest-growing data and AI platforms. Before that, he spent more than 16 years at Google, where he co-founded the Purple Team program and worked as an SRE on Gmail. Today, he continues to contribute to open source DFIR projects while consulting independently.

Anuj Madaan (Tesla)

Tesla runs some of the most advanced AI and infrastructure systems in the world, where reliability is critical to both manufacturing and autonomous driving. Anuj Madaan is Senior Incident Manager for Infrastructure, based in Berlin, leading response and change management to keep Tesla’s production systems resilient. He previously spent four years at IBM as a Systems Management Specialist, building his expertise in incident and service operations at enterprise scale.

Shane Arseneault (theScore)

Sports fans rely on theScore for real-time updates and insights. Behind the scenes, Shane Arseneault leads incident management, guiding the organization through major reliability challenges. Starting as an incident commander and now Head of Incident Management, he’s built processes and teams that keep critical systems online. Previously, he worked at Sun Life, bringing experience in large-scale systems monitoring and reliability operations.

Gandhi M N Kumar (Twilio)

Communications platforms like Twilio need to stay up under any circumstance. Gandhi M N Kumar serves as a Principal Incident Commander in Platform Engineering, leading the response to high-priority incidents across Twilio’s global products. Before Twilio, he held senior leadership roles at Barclays, managing product and collections strategy with a focus on resilience in financial systems.

Brittney Greene (Wealthsimple)

Brittney Greene manages incident response and technical support at Wealthsimple, a financial services platform used by millions across Canada. She’s built and scaled teams that improve resolution times and enhance client experience, bringing both engineering and social engagement practices into incident response. Over the past three years, she’s grown through progressively senior leadership roles, making reliability a cornerstone of Wealthsimple’s customer operations.

Casey Brown (Weights & Biases)

Weights & Biases builds the infrastructure powering modern AI development. Casey Brown leads infrastructure engineering, driving the systems that let practitioners experiment and scale safely. Previously, he spent nearly six years at Venmo, where he rose to Head of Platform Engineering and oversaw critical reliability initiatives. He also served as an advisor at Holberton School, supporting the next generation of engineers.

Mark Chounlakone (Windsurf)

Mark Chounlakone is a Site Reliability Engineer at Windsurf, an AI-first developer tools company focused on accelerating the coding workflow. He supports the reliability and performance of the platform that powers their next-generation IDE. Previously, he worked as a Robotics Engineer at Nimble Robotics and as a Manufacturing Test Engineer at Google, giving him a foundation that spans both hardware and large-scale systems.

Tricia Leavitt (Writer)

Tricia Leavitt is a Platform Engineer at Writer, the enterprise AI platform focused on safe, governed AI for businesses. She works at the intersection of infrastructure and developer experience—automating, hardening, and tuning the platform. Before Writer, she led platform/DevOps efforts at Pequity and Mission, introducing IaC and building CI/CD foundations.

Vivian Fernandez (Yahoo)

Yahoo runs internet-scale consumer services relied on by millions. Vivian Fernandez is Senior Director of Production Engineering, leading reliability and production operations across evolving platforms. With nearly a decade in senior leadership at Yahoo, she’s driven large-scale service engineering and infrastructure management through multiple waves of modernization.

Christopher Puziak (Zillow)

Christopher Puziak is a Senior Manager at Zillow, where he leads incident management to ensure the reliability of large-scale real estate platforms. Before Zillow, he was Senior Manager of Incident Management for Global Streaming Technology at Peacock, where he built and scaled the incident management function across a global audience. With over a decade in IT service and incident management, Christopher has become a trusted leader in keeping critical systems resilient and high-performing.

Duncan Winn (Zscaler)

Duncan Winn is VP and Lead of SRE at Zscaler, helping secure and operate cloud services used by enterprises worldwide. Prior to Zscaler he spent 5+ years at Google, where he directed SRE teams for storage and for security/privacy in sovereign cloud—background that informs Zscaler’s reliability and trust posture.

Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Bood a demo
Bood a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Bood a demo
Bood a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Book a demo
Book a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Bood a demo
Bood a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Book a demo
Book a demo