Get Rootly's Incident Communications Playbook

Don't let an incident catch you off guard - download our new Incident Comms Playbook for effective incident comms strategies!

By submitting this form, you agree to the Privacy Policy and Terms of Use and agree to sharing your information with Rootly and Google.

Back to Blog
Back to Blog

August 28, 2025

5 mins

Rootly joins Groq OpenBench with an SRE-focused benchmark

Making LLM evaluations reproducible for real-world SRE workflows

Sylvain Kalache
Written by
Sylvain Kalache
Rootly joins Groq OpenBench with an SRE-focused benchmarkRootly joins Groq OpenBench with an SRE-focused benchmark
Table of contents

Over the past few months, Rootly has been developing an SRE-oriented benchmark to test how fit the emerging LLMs are for code understanding and code writing in real-world scenarios faced by SREs. Our research has been featured at ICML and ACL 2025, two of the leading conferences in the field of machine learning.

Today, we’re excited to announce that Rootly’s research-based SRE benchmark is now available in Groq OpenBench, which means you can use our benchmark to evaluate models with a single line of code, instead of the complex setup that was required before.

Groq released OpenBench to the open-source community to address a growing problem: figuring out what’s the best model for your use case. Given different providers use different prompts, formats, and scoring systems, it’s not easy to know which model is the right one for you. OpenBench helps you evaluate LLMs in a reliable, reproducible way, supports 18 providers, and aggregates 35+ benchmarks.

LLMs are becoming essential tools for SREs and platform teams, but general-purpose benchmarks don’t tell you how well a model will triage an incident, read logs, or suggest a mitigation. That’s why at Rootly we built our own benchmark focused on real-world SRE workflows, and now, it runs on OpenBench, so you can have access to it too.

Groq OpenBench x Rootly AI Labs

Before Groq OpenBench, our team at Rootly AI Labs had to juggle multiple eval frameworks, each with its quirks around prompting, parsing, and scoring. That made it hard to compare models fairly and slowed us down. It also made our benchmark harder to reproduce by more people.

OpenBench changes that. It offers a standardized, repeatable, and provider-neutral way for testing language models. Thanks to OpenBench’s native multithreading and automatic retry system, we’ve drastically reduced the time required to run our benchmarks without compromising rigor or reproducibility.

Why This Matters for SREs

Most leading benchmarks focus on general reasoning or code generation, which doesn’t reflect what SREs and platform engineers do day-to-day.

As our CEO JJ Tang puts it:

Most benchmarks test a model's coding ability, but that isn't the best indicato for SREs. They need models that can triage incidents, interpret logs, suggest mitigations, and more.

That’s why we built a benchmark focused on real-world SRE tasks. And now, through OpenBench, anyone can test models for SRE tasks with simple command lines.

The Rootly SRE Benchmark is Open Source

All the work that we do at the Rootly AI Labs is open source. Our SRE benchmark can be run independently of any framework. But using OpenBench makes it much easier.

What we’ve contributed:

  • One of our SRE-focused benchmark tasks
  • 50% of our accompanying dataset (to reduce the risk of models overfitting on it)

Each of our four tests runs on a dataset of ~1,200 samples per model, which adds up quickly when evaluating across providers. Our methodology is fully open source, and the benchmark has been featured at ICML and ACL 2025.

You can find the benchmark and test documentation in the OpenBench repo.

How to Get Started

To test our Rootly AI Labs benchmark on OpenBench:

# Create a virtual environment and install OpenBench
uv venv
source .venv/bin/activate
uv pip install openbench


#Set your API key (any provider!)
export GROQ_API_KEY=your_key  # or OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.

#Run Rootly’s benchmark
bench eval gmcq --model "groq/llama-3.1-8b-instant" --T subtask=mastodon

You can find more information about Rootly’s GMCQ methodology on GitHub.

Get Involved

This is just the beginning. Rootly AI Labs is continuing to develop more tests specific to reliability engineering, and we’re actively looking to collaborate.

If you’re working in AI, observability, or infrastructure reliability and want to help shape the future of AI for SREs, reach out to us or contribute directly on GitHub.

Big thanks to the Groq team for making OpenBench possible.

Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Bood a demo
Bood a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Bood a demo
Bood a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Book a demo
Book a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Bood a demo
Bood a demo
Rootly_logo
Rootly_logo

AI-Powered On-Call and Incident Response

Get more features at half the cost of legacy tools.

Book a demo
Book a demo