Llama 4 underperforms: a benchmark against coding-centric models

Rootly AI Labs analyzes the performance of Meta’s Llama 4 models and finds they underperform compared to competitors like Claude 3.5 Sonnet and Qwen2.5

Written by

Sylvain Kalache

Llama 4 underperforms: a benchmark against coding-centric models

Table of contents

The Rootly AI Labs is building the future of reliability, and part of our mission is to produce research that's shared to advance the standards of operational excellence. As Meta released Llama 4, accusations quickly surfaced claiming the company artificially tuned their models to achieve impressive benchmark results. That sparked our curiosity to independently evaluate Llama 4's capabilities.

Our evaluation delivered surprising results: the highly anticipated Llama 4 significantly underperformed compared to its predecessor, Llama 3, and fell notably behind specialized coding models from Alibaba and OpenAI on a coding-centric benchmark that the Labs put together. We also could not reproduce the findings of Llama 4 beating multimodal models such as GPT-4o, Gemini 2.0, and DeepSeek v3.1.

What's New with Llama 4?

Before we dive into the findings, let’s talk about Llama 4. Meta's Llama 4 series introduces three advanced models: Scout, Maverick, and Behemoth. Each leverages an innovative Mixture of Experts (MoE) architecture, which is supposed to enhance efficiency by activating only relevant parameters necessary for a specific task.

Unlike conventional ensemble methods—where multiple models collectively predict outcomes—MoE utilizes a gating network to selectively activate specialized "expert" sub-models tailored to handle distinct aspects of input data. Scout and Behemoth each have 16 experts, while Maverick has 128.

What We Tested Llama 4 Against

We chose to compare Llama 4 against two different sets of models. First, we wanted to test against leading multimodal models and replicate Meta's findings. Meta found in its benchmark that Llama 4 was beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding.

The second comparison was against models designed for coding tasks: Alibaba Qwen2.5-Coder, OpenAI o3-mini, and Claude 3.5 Sonnet. LLMs are typically optimized to perform best on specific attributes like speed, deep reasoning, conversational ability, domain-specific knowledge, and multimodality (handling inputs beyond text, such as images or audio). These attributes should be carefully understood by end-users who want to use them for practical scenarios. For example, a customer chat agent should use a model optimized for speed, whereas an accounting firm would want to use a domain-specific LLM. While none of the Llama 4 models were specifically designed for coding, Maverick should be the best positioned to perform on coding tasks.

Benchmarking Methodology

To measure performance, Rootly AI Labs fellow Laurence Liang developed a Multiple Choice Questions benchmark leveraging Mastodon’s public GitHub repository. Here is our methodology:

We sourced 100 issues labeled "bug" from the Mastodon GitHub repository.
For each issue, we collected the description and the associated pull request (PR) that solved it.
For benchmarking, we fed models each bug description and 4 PRs to choose from as the answer, with one of them being the PR that solved the issue—no codebase context was included.

Our Findings

We could not reproduce Meta’s findings on Llama outperforming GPT-4o, Gemini 2.0 Flash, and DeepSeek v3.1. On our benchmark, it actually came last in accuracy (69.5%), 6% less than the next best performing model (DeepSeek v3.1) and 18% behind the overall top-performing model (GPT-4o).

Despite high expectations, Llama 4 models struggled significantly in our SRE-focused benchmark:

Llama 4 Maverick achieved only a 70% accuracy score.
Alibaba’s Qwen2.5-Coder-32B unsurprisingly topped our rankings, closely followed by OpenAI's o3-mini, both of which achieved around 90% accuracy.
Llama 3.3 70B-Versatile, the previous generation, surprisingly outperformed the latest Llama 4 models by a small yet noticeable margin (72% accuracy).

Llama 4 did Not Live Up to the Hype

Despite Meta’s Llama 4’s sophisticated architecture and the promising benchmarks that advertised it, we did not find these models to perform well compared to the competition on our benchmark. Specialized models such as Alibaba Qwen-code and OpenAI o3-mini seem to remain far superior choices for coding-related tasks.

You can find the benchmark dataset here to attempt to reproduce our findings. While this is a small test set, we are looking at increasing the number of tasks that we test models against. Our benchmark is an open-source initiative built by the Rootly AI Labs. Reach out to us if you are interested in participating.

‍