Gemini 3 beaks OpenAI’s long-standing lead in SRE tasks.Gemini 3 beaks OpenAI’s long-standing lead in SRE tasks.

Gemini 3 beaks OpenAI’s long-standing lead in SRE tasks.

A shift just happened in SRE AI performance. Gemini 3 Pro didn’t just edge out OpenAI’s models, it beat them across every SRE task we threw at it. The landscape is changing faster than anyone expected.

Sylvain Kalache

Sylvain Kalache

November 24, 2025
4 minutes
Benchmarking LLMs for SRE-tasks, boosting Sonnet 4.5 performance by 100%Benchmarking LLMs for SRE-tasks, boosting Sonnet 4.5 performance by 100%

Benchmarking LLMs for SRE-tasks, boosting Sonnet 4.5 performance by 100%

The new edition of our benchmark features Terraform tasks across AWS, GPC, and Azure, plus incorporates a new dimension: prompt-optimization.

Sylvain Kalache

Sylvain Kalache

October 8, 2025
10 mins
Introducing the On-Call Burnout DetectorIntroducing the On-Call Burnout Detector

Introducing the On-Call Burnout Detector

An open source, research-based tool that looks for early-warning signs of burnout in your on-call engineers.

Sylvain Kalache

Sylvain Kalache

September 25, 2025
5 mins
SRECon EMEA 2025: Top Talks + EventsSRECon EMEA 2025: Top Talks + Events

SRECon EMEA 2025: Top Talks + Events

5 AI and reliability talks you can’t miss, plus the perfect after-conference events to wrap up Days 1 and 2 in Dublin

Sylvain Kalache

Sylvain Kalache

September 16, 2025
7 mins
Rootly joins Groq OpenBench with an SRE-focused benchmarkRootly joins Groq OpenBench with an SRE-focused benchmark

Rootly joins Groq OpenBench with an SRE-focused benchmark

Making LLM evaluations reproducible for real-world SRE workflows

Sylvain Kalache

Sylvain Kalache

August 28, 2025
5 mins
Announcing Rootly AI Labs: Accelerating Reliability Engineering Through Community-Driven InnovationAnnouncing Rootly AI Labs: Accelerating Reliability Engineering Through Community-Driven Innovation

Announcing Rootly AI Labs: Accelerating Reliability Engineering Through Community-Driven Innovation

Reliability engineering is evolving quickly—and AI is the catalyst. That’s why we’re excited to unveil Rootly AI Labs, a community-focused program dedicated to reshaping reliability through open collaboration, innovative prototypes, and cutting-edge research.

Sylvain Kalache

Sylvain Kalache

April 25, 2025
5 mins
Llama 4 underperforms: a benchmark against coding-centric modelsLlama 4 underperforms: a benchmark against coding-centric models

Llama 4 underperforms: a benchmark against coding-centric models

Rootly AI Labs analyzes the performance of Meta’s Llama 4 models and finds they underperform compared to competitors like Claude 3.5 Sonnet and Qwen2.5

Sylvain Kalache

Sylvain Kalache

April 11, 2025
6 mins
Introducing the Rootly MCP ServerIntroducing the Rootly MCP Server

Introducing the Rootly MCP Server

Connect Rootly to Cursor, Claude or Copilot with our open source MCP Server, available on GitHub.

Sylvain Kalache

Sylvain Kalache

March 20, 2025
5 mins
Introducing Rootly’s API AI-Agent-First ApproachIntroducing Rootly’s API AI-Agent-First Approach

Introducing Rootly’s API AI-Agent-First Approach

Rootly’s AI-agent-first API, built on the Agents JSON standard, enables LLM-powered agents to automate workflows, streamline data handling, and enhance incident response.

Sylvain Kalache

Sylvain Kalache

February 25, 2025
3 mins
Classifying Error Logs with AI: Can DeepSeek R1 Outperform GPT-4o and Llama 3?Classifying Error Logs with AI: Can DeepSeek R1 Outperform GPT-4o and Llama 3?

Classifying Error Logs with AI: Can DeepSeek R1 Outperform GPT-4o and Llama 3?

Can a smaller AI model outperform a larger one? A distilled version of DeepSeek R1 (70B) outperformed Llama and nearly matched GPT-4o in classifying error logs. These results suggest that model efficiency, not just size, is key to AI performance in incident management.

Sylvain Kalache

Sylvain Kalache

February 19, 2025
6 mins