Classifying Error Logs with AI: Can DeepSeek R1 Outperform GPT-4o and Llama 3?

Can a smaller AI model outperform a larger one? A distilled version of DeepSeek R1 (70B) outperformed Llama and nearly matched GPT-4o in classifying error logs. These results suggest that model efficiency, not just size, is key to AI performance in incident management.

Written by

Sylvain Kalache

Classifying Error Logs with AI: Can DeepSeek R1 Outperform GPT-4o and Llama 3?

Table of contents

DeepSeek R1 has been found to have performance comparable to OpenAI’s o1 model and can even exceed it in some cases. As part of Rootly’s mission to make incident management as easy as possible using the latest AI tools, we wanted to evaluate how DeepSeek R1 would perform when analyzing system error logs (Apache web server) and how it compared to other language models.

To achieve this, we partnered with the Waterloo-based HackOS team for a DeepSeek hackathon to distill DeepSeek R1 from 671 billion parameters to 70 billion and benchmark its performance. Model distillation is a technique for creating smaller, more efficient versions of large AI models while maintaining most of their performance. This process is crucial for making powerful AI models more accessible and practical for wider use.

The hypothesis was that a distilled version of DeepSeek R1 could still outperform other larger models. During this hackathon, DeepSeek R1 was distilled to 70B and tested against Llama 3.3 70B and GPT-4o, whose parameter count remains undisclosed but is likely significantly higher. For clarity, I’ll refer to them as DeepSeek and Llama throughout this article.

The results showed that the distilled DeepSeek model performed 4.5 times better than Llama and nearly twice as well as GPT-4o in classifying error types in server logs. However, GPT-4o still had a slight edge in classifying severity levels.

Let’s dive into the methodology.

Benchmarking

Rootly provided a testing dataset, which the HackOS team transformed into a hackathon dataset and benchmark tooling that introduced four metrics to compare DeepSeek with other models:

Error type loss: classification task on the error type
Severity loss: classification task on the severity
Description loss: a statement describing the problem
Solution loss: a statement summarizing a solution

The description and solution losses were calculated by comparing how similar each model’s output was to DeepSeek’s output, using a method called cosine similarity on text embeddings created by mpnet-base-2. This allowed us to measure how closely another model's output aligned with DeepSeek’s. In this context, losses refer to the differences or errors measured between a model’s output and the "correct" output—where lower losses are desirable.

Model performances were compared across tasks using error type and severity loss. The description and solution losses evaluated consistency in semantic content between models. While this benchmark could be vastly improved, we were under a time constraint and plan to enhance it for future benchmarking.

Findings

Groq’s DeepSeek model—distilled on the Llama architecture—was compared to GPT-4o and Llama via Groq’s API. In all cases, the same prompt was used.

The distilled DeepSeek model outperformed GPT-4o (0.33) and Llama (0.85) in classifying error types, achieving an error type loss of 0.18. Given that the Llama model’s architecture was used to distill DeepSeek, this one-to-one comparison demonstrates that DeepSeek can achieve superior performance relative to models of a similar size.

However, in classifying the severity of error logs, GPT-4o performed slightly better (0.0437) compared to DeepSeek R1 (0.0563), while both models significantly outperformed Llama (0.9688).

Model	Error Type Loss	Severity Loss
DeepSeek R1 Distilled 70B (Groq)	0.1875	0.0563
Llama 3.3 70B (Groq)	0.8500	0.9688
GPT-4o	0.3312	0.0437

‍

Because the benchmarking considers DeepSeek’s output text as the “ground truth,” it isn’t practical to use this method to compare DeepSeek with other models, as there is an inherent bias toward it. Instead, the benchmark measured how similar DeepSeek’s outputs were to those of GPT-4o and Llama.

The losses for both description and solution were similar in size and were calculated using cosine similarity, as explained earlier. This method compares the meaning of two phrases based on their word embeddings. For example, “ice is cold” and “snow is chilly” are more similar than “ice is cold” and “this toast is warm.”

The results show that the meaning of the text outputs of DeepSeek was similar to those of GPT-4o and Llama, which implies that a user’s expectation of a GPT-4o output should be similar to DeepSeek’s output. This does not assess the quality of the answer by any means.

Model	Description Loss	Solution Loss
DeepSeek R1 Distilled 70B (Groq)	0.1300	0.1956
Llama 3.3 70B (Groq)	0.1453	0.2187
GPT-4o	0.1356	0.2353

‍

It is important to note that for semantic metrics, the benchmarking dataset only considered one possible “ground truth” value for each sample. This means that logged lines with multiple possible interpretations were not entirely represented.

Distilled DeepSeek for the Win

While the HackOS hackathon findings are preliminary, they highlight that a distilled version of a large model can perform better than non-distilled models of the same size when analyzing error logs. Given these strong performances, it would be interesting to further distill DeepSeek R1 and see if the model maintains similar performance levels when compared with larger models such as OpenAI’s o1 and o3. This also suggests that smaller, distilled models could be efficiently embedded within various parts of our monitoring and logging stack, significantly improving speed and augmenting our ability to process error logs in real time.

We believe that LLMs will become an essential part of an SRE toolbelt. If you are interested in pioneering this topic further or getting your hackathon sponsored by Rootly, please drop us a line.

Credits

The HackOS 3 DeepSeek Hackathon was organized by Akira Yoshiyama, Aniket Srinivasan, and Laurence Liang. It was hosted simultaneously at the University of Waterloo and McGill University.

William Zeng, Jerry Zhu, and Isabelle Gan developed Shallow Search, the DeepSeek eval repository used to benchmark DeepSeek R1 70B—based on Aniket’s benchmark repository. Laurence Liang wrote the additional code to compare DeepSeek R1 70B with GPT-4o and Llama 3.3-70B. Aniket and Laurence Liang provided a written summary of the findings.