
How to Structure an Incident Response Team: Roles, Responsibilities, and Workflows
Learn how to structure an incident response team with defined roles, responsibilities, and workflows to reduce downtime and improve resilience.
August 28, 2025
5 mins
Making LLM evaluations reproducible for real-world SRE workflows
Over the past few months, Rootly has been developing an SRE-oriented benchmark to test how fit the emerging LLMs are for code understanding and code writing in real-world scenarios faced by SREs. Our research has been featured at ICML and ACL 2025, two of the leading conferences in the field of machine learning.
Today, we’re excited to announce that Rootly’s research-based SRE benchmark is now available in Groq OpenBench, which means you can use our benchmark to evaluate models with a single line of code, instead of the complex setup that was required before.
Groq released OpenBench to the open-source community to address a growing problem: figuring out what’s the best model for your use case. Given different providers use different prompts, formats, and scoring systems, it’s not easy to know which model is the right one for you. OpenBench helps you evaluate LLMs in a reliable, reproducible way, supports 18 providers, and aggregates 35+ benchmarks.
LLMs are becoming essential tools for SREs and platform teams, but general-purpose benchmarks don’t tell you how well a model will triage an incident, read logs, or suggest a mitigation. That’s why at Rootly we built our own benchmark focused on real-world SRE workflows, and now, it runs on OpenBench, so you can have access to it too.
Before Groq OpenBench, our team at Rootly AI Labs had to juggle multiple eval frameworks, each with its quirks around prompting, parsing, and scoring. That made it hard to compare models fairly and slowed us down. It also made our benchmark harder to reproduce by more people.
OpenBench changes that. It offers a standardized, repeatable, and provider-neutral way for testing language models. Thanks to OpenBench’s native multithreading and automatic retry system, we’ve drastically reduced the time required to run our benchmarks without compromising rigor or reproducibility.
Most leading benchmarks focus on general reasoning or code generation, which doesn’t reflect what SREs and platform engineers do day-to-day.
As our CEO JJ Tang puts it:
Most benchmarks test a model's coding ability, but that isn't the best indicato for SREs. They need models that can triage incidents, interpret logs, suggest mitigations, and more.
That’s why we built a benchmark focused on real-world SRE tasks. And now, through OpenBench, anyone can test models for SRE tasks with simple command lines.
All the work that we do at the Rootly AI Labs is open source. Our SRE benchmark can be run independently of any framework. But using OpenBench makes it much easier.
What we’ve contributed:
Each of our four tests runs on a dataset of ~1,200 samples per model, which adds up quickly when evaluating across providers. Our methodology is fully open source, and the benchmark has been featured at ICML and ACL 2025.
You can find the benchmark and test documentation in the OpenBench repo.
To test our Rootly AI Labs benchmark on OpenBench:
# Create a virtual environment and install OpenBench
uv venv
source .venv/bin/activate
uv pip install openbench
#Set your API key (any provider!)
export GROQ_API_KEY=your_key # or OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.
#Run Rootly’s benchmark
bench eval gmcq --model "groq/llama-3.1-8b-instant" --T subtask=mastodon
You can find more information about Rootly’s GMCQ methodology on GitHub.
This is just the beginning. Rootly AI Labs is continuing to develop more tests specific to reliability engineering, and we’re actively looking to collaborate.
If you’re working in AI, observability, or infrastructure reliability and want to help shape the future of AI for SREs, reach out to us or contribute directly on GitHub.
Big thanks to the Groq team for making OpenBench possible.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.
Get more features at half the cost of legacy tools.