AI Didn't Change the Game — It Just Exposed Your Bottlenecks w/ Ganesh Datta (Co-Founder & CTO, Cortex)

Ganesh Datta
🍺
Founded Cortex over a beer
📋
Sick of the spreadsheet
🎙️
Hosts Braintrust podcast
❄️
Hates "special snowflakes"

Listen on Spotify and Apple Podcasts!

Table of contents

Every engineering org says they want to improve reliability, but most can't even agree on what "good" looks like. Ganesh Datta, Co-Founder and CTO of Cortex, has spent the better part of a decade helping companies confront that gap. In this conversation, Ganesh makes the case that platform engineering and SRE are solving the same human problem — earning adoption through influence, not authority — and that the two teams belong on the same user journey. He breaks down why production readiness programs decay the moment you stop automating them, why clear service ownership is the foundation everything else depends on, and why AI hasn't actually changed the playbook: it's just amplifying the bottlenecks you were already ignoring.

How did Cortex get started?

I'm one of the co-founders and CTO of Cortex. I started Cortex very much based on my own personal pain as a software engineer. Before starting Cortex, I was at a FinTech startup for a while. I saw the monolith-to-microservices journey from being part of the team that pulled the very first service out of the monolith to a couple hundred services by the time I left.

Over that time, I had the chance to build our first cookie-cutter-based templating for spinning up new services and rolled out our first production readiness program to try to standardize the way we were doing things across teams. But a lot of this pain eventually led to a lack of visibility into who owned things. You'd get paged at 2:00 AM for an incident, you're trying to figure out who to loop in, and you don't really know what the service does or who owns it.

Trying to get people to care about those production standards was painful — you're running after people to fill out this giant spreadsheet. So eventually I said, "I'm sick of this spreadsheet. There must be a better solution." This was back in 2019. Turns out there was not yet a better solution. I'd search "service catalog" or "microservice catalog" and get things like Consul or etcd — system-to-system service catalogs, not human-facing ones.

My two co-founders, Anish and Nikhil, were working at Uber and Twilio at the time. Over a beer in SF, I asked them how they solve the same problem. They said, "Oh, we just Slack people. Just tag somebody and try to figure it out." It was a light bulb moment. A company of thousands of engineers and a company of a hundred engineers had the exact same problem. There's something here. And that's how Cortex started.

What actually drives platform adoption when you can't mandate it?

The way I like to think about it is: if you think about a platform as an internal-facing product you're delivering to developers, it really is a product problem. Let's say you're building the next Instagram. You can't hold a gun to people's heads and say, "Download my app right now." A good product works because you solve a problem people have, you iterate really quickly, and you deliver great value. People will want to use it because it does something another app doesn't.

Internal platforms work the same way. Yes, maybe you have an authority figure who can mandate the platform, but it's still not going to get the adoption you want. So the way to drive adoption is to think about it like a product. What problem are we trying to solve? Why am I building this platform in the first place? Our CTO or VP allocated five headcount to this — there's some business reason we're putting scarce resources against it. Let's go find that business problem.

Then ask: what is currently preventing our developers from solving that problem? Something is fundamentally wrong in our platform. Let's deliver an iterative version and make it really powerful, really easy, and just a great experience. People will come. It naturally creates a feedback loop where people use what you're building because it's so much better than the alternative — not because you forced it on them. Start small, focus on a key value principle, deliver incremental value, and make people love what you're building.

Why don't SRE and platform engineering teams collaborate more?

It's interesting because SRE is very similar to platform engineering in this sense: you have a very important charter — help us figure out stuff when things go wrong, or prevent things from going down — but you have no authority. It's an influence-based function. You can define best practices and ask people to adopt them, maybe put stricter guardrails around it, but it's very much influence-based. And similar to platform, the ratio of that team to the broader development group is very low. One SRE to 20, 30, 100 developers, if not more.

So you're dealing with two problems: you're trying to work through influence, and you have very little reach relative to the size of the organization. I think the reason these functions don't work together as much as you'd want is that at a high level, they seem to have two very different charters — one focused on reliability, one on infrastructure and developer experience. Those seem like very different things.

But if you peel the layers back, why is the influence part of the SRE role so hard? Because you're trying to get developer teams to adopt certain practices. That sounds very much like the platform problem. Now you have two groups trying to solve a very similar human problem. So if we think about how they should overlap: if we're both trying to influence other teams to adopt practices and tooling, why don't we combine forces from a development standpoint to get the adoption we both care about?

What does it look like when SRE and platform engineering actually collaborate well?

Think about it as a user journey. The platform team is building the set of capabilities for a developer to bootstrap new code, new infrastructure, and take it to production. The SRE team is putting guardrails in place that say, "If you're going to production or you're in production, these are all the things you should be doing for a high-reliability service."

These are both part of the exact same user journey — just different parts. Maybe the platform engineering team owns from zero to the 65% mark, and then the SRE team comes in and puts restrictions for production readiness, and then it's back into the platform to take the thing to production. That screams to me that the SRE team should be part of the platform capabilities.

Here's a very specific example. SRE teams care about production readiness: you have an on-call rotation, dashboards for quick diagnosis, you're registered with your incident management tool, you have uptime alerts, latency alerts, you're managing your vulnerability posture. These all sound like things you can automate. What if the SRE team went to the platform team and said, "These are the things we expect any new service going to production to be doing — what if we helped you bake those into the platform?"

Now if I'm a developer, I spin up a new service using the platform tooling and I get metrics and alerts out of the box. It auto-registers me in the incident management tool. I get all of these things for free. Both teams win. The platform team delivers value because developers know the SRE team is going to come after them anyway — they might as well use the platform that makes it easy. The SRE team is happy because new services are created with everything baked in by default.

Should platform engineers be embedded in development teams?

I think it's a really interesting model. We haven't seen that specifically, but what we are seeing in a lot of organizations is these groups getting rolled up into a single reporting structure. Rather than having SRE in one structure and platform engineering in a separate one, you have platform, SRE, developer experience, sometimes security, all rolling up into a single VP. That's really important.

I like to think of it as: they're helping drive excellence across the organization across different dimensions — speed, reliability, security. By having all those groups roll up into a single leader, you get accountability. That leader is accountable not just for developer experience, but also for reliability and security. They're incentivized to have those teams work together. Even though you have different teams with different charters underneath, you have a single leader accountable for global outcomes, and that forces those teams to work together in concert.

The other thing you should absolutely do if you're a platform engineering team: hire a product manager, or designate somebody to operate like one — someone who brings SRE and gives them a seat at the table. Don't think of developers as your only stakeholder. Think about SRE as a champion and end user of your platform and bring them along for the journey, because they'll create the right outcomes.

Why do organizations need a shared definition of "good"?

If all of your teams don't realize what good looks like, you're going to have different outcomes because people are all working toward totally different things. This is the importance of KPIs or OKRs — whatever you want to call them. It's the goal. We're all moving in the same direction, and if we're not doing something related to that, we should question it.

Within engineering organizations that have internal-facing goals, we often don't do this. We say, "We want to improve reliability. We want to improve MTTR." Fine, we have a metric. But what does good look like? Is one hour good? Is 30 minutes good? Is 24 hours good?

And by the way, trying to move the entire organization toward a single number for internal metrics is not particularly meaningful. What you want to give people is very tactical things. I think about metrics as lagging indicators. You're not going to tell a team, "Improve your MTTR." You're going to say, "These are the types of issues you're dealing with. I want you to adopt a new incident management platform because having a good process will help us deal with the chaos, which will then improve MTTR."

If you have a shared definition of good, you can hold people accountable. If you don't know what good looks like, how can you hold teams accountable? "Well, you didn't tell me what good looks like, so I did this other thing." But if we all understand what good looks like, we can march toward it. And if we're not marching toward it, we can step back and ask why. On the flip side, if you're not getting time or resources, you have something to point at: "Hey, remember when we all agreed this is what good looks like? We're doing stuff that's not going to help us get better. Are we okay with that?"

How should a team go about defining production readiness?

Start with a question: what are we trying to solve? Why do we care about production readiness? The reason is we know we're not going to catch all the issues before something goes to production. Things will go wrong. And it's all about: when something goes wrong, do we have the right things in place to mitigate it?

So production readiness is a superset of that problem — what are all the practices that could help us figure something out when something goes wrong? Having the right monitors and alerts, having an incident management tool set up for every service, making sure you know the accountable owner so you can escalate. Knowing the service has met basic code coverage and quality requirements — if you have zero tests, it's probably not ready. Security can be part of it too: if you haven't set up a vulnerability scanner or you have five critical vulnerabilities, maybe you're not ready for production.

Production readiness is about enforcing practices before something goes to production so that if something goes wrong, you have the right tools, systems, and people in place.

How flexible should organizations be with these standards?

Eighty-five percent of your requirements should be shared. If you don't have shared requirements, it's no longer a shared definition of good. And I'll be honest — in our business, we see this a lot. Companies come to us and say, "We're special. We do things differently." I'm sorry, you don't do things differently. Everyone thinks they're a special snowflake. You have the same problems as everyone else. People describe eight different problems to me — I've heard those exact problems 400 times this year.

Whether you have 20-year-old services running on ancient JVMs or the hot new thing on serverless, your operational practices to go into production are probably similar. You still want accountable owners. You want an incident management program, code quality metrics, vulnerability scanning. Those things are true across the board.

Here's another reason these things need to be the same. The whole point of software engineering is breaking down a problem to its most repeatable steps and codifying it. That's what code does. If you can define a set of best practices, you can probably also codify edge cases in a meaningful way. And if you have a shared set of requirements across your organization, that is what allows you to automate it.

If teams think they're special snowflakes, you're never going to automate it. If you never automate these production readiness standards, it's never going to work — you'll have a spreadsheet that's seven months out of date. The other reality is a lot of organizations do production readiness once before something goes to prod and never do it again. But six months later it may no longer be considered ready because something changed. If you can't automate it, you'll never catch those things, and your production readiness program falls apart. It's basically as good as nothing.

Where should teams start if they want to build a production readiness program?

It's a bit of the chicken or the egg. For most organizations, you don't really know what to improve because you're not measuring anything. You can't improve what you can't measure. So start with visibility into your issues — what are the most common failure modes? That's a really important place to start.

The second thing is accountability. Let's say you roll out a production readiness program and you have the best framework, all the right requirements, a shared definition of good, all the data. And then you say, "Okay, now we're going to ask teams to improve production readiness." The next question is, "Who is going to do that?" Programs like this fail if you don't have clear accountability. If you can't point to a service and say, "This team owns it, and if it's not meeting our definition of good, I know who to go to" — then the whole thing falls apart.

So: data, clear accountability and ownership, and then in parallel you can define the framework.

And this accountability thing is not just for reliability. Having clear on-call rotations and escalation paths is critical for incidents. You don't want to be asking, "Is it your problem? Is it your problem?" during an incident. It's like the bystander effect — you don't say, "Somebody call 911." You say, "Ganesh, you call 911." All right, I'm accountable. Let me do it. Start with ownership, clear accountability, and then you have the right foundation to run these programs on an ongoing basis.

How has GenAI impacted this space?

My hot take is that nothing has really changed. And I know that's spicy, but what I mean is: yes, maybe the mechanism by which we're writing code has changed. But even the recent DORA report talked about this — AI is just an amplifier. It's showing you the existing problems in your environment and processes.

If you have a bottleneck in code reviews, you write more PRs because AI is writing more code, and you just have a worse bottleneck. If you have really slow builds, now you're adding 20% more code to your repo, and your builds are going to get slower. AI is not going to magically solve that.

Look at incident management — something goes wrong, and maybe you have AI SRE tools to help you manage incidents. But those tools can only work if you have monitors, alerts, and logging in place so the AI can operate on that data. All of a sudden, it sounds like the things we've been talking about for years: make sure your foundations are in place, do good testing so agents can write code without breaking things, think about bottlenecks and improve them one by one, follow basic production readiness and reliability standards.

Whether it's a human or an agent managing an incident, those are the things it needs. So invest in those foundations. Nothing has changed. Keep investing in the basics, and it will actually make your ability to adopt AI much, much better.