What is an SRE?
A comprehensive definition of SREs and Site Reliability Engineering, including what SREs do and what makes SREs different from other roles.
July 2, 2024
6 mins
Discover how Google is optimizing for accuracy in its AI strategy, while Meta strives to expand its response capabilities through machine learning.
The world population in 2024 is approximately 8.12 billion people. Of these, 4.3 billion people use Google regularly, while 3.74 billion are active users on Meta's platforms. Any disturbance involving these tech giants will surely make headlines, as seen in the recent Google’s Unisuper incident. The scale of these tech companies brings fascinating challenges in every aspect of their operations, including incident response.
When tens of thousands of diffs are being pushed to production every few hours, how do you tackle emerging incidents? Conceptually, the incident response process is not too different from response best practices: identification → coordination → resolution → closure.
However, the ways in which Google and Meta are implementing AI incident response in different. In this article, I’ll break down the wins and challenges that each organization is experiencing while incorporating machine learning and generative AI into their incident management.
{{subscribe-form}}
Google’s approach to generative AI (Gen AI) in incident response focuses on leveraging the strengths of large language models (LLMs) while minimizing their potential drawbacks. They are primarily using AI to summarize incidents, which can save significant time to their response team without compromising the accuracy of their incident response process.
The security team at Google is using ML algorithms to gather all conversations, logs, graphics, code and any other event data related to an incident to to generate complete and accurate summaries that comply with their response guidelines.
After several rounds of optimizations, Google achieved a 10% improvement in scores for AI-generated summaries compared to human-written ones, as assessed by site reliability engineers (SREs) unaware of the summaries' authors. Google expects their team will spend 51% less time writing incident summaries thanks to AI.
Google has also ventured into generating executive summaries for stakeholders, enabling incident commanders to speed up their communications. The results were on par with human-written summaries and helped responders prepare them in 52% less time.
While generative AI incident summaries are practically indistinguishable from human-written ones at Google, thanks to their guidelines and standards, there is always a risk of LLMs producing hallucinations and errors. Therefore, Google requires a human intervention to approve each summary, making the person signing off the one responsible for its accuracy.
Meta follows a different approach to AI incident response. Instead of letting AI be a supplementary tool for their response team, Meta is aiming to expand their incident response capabilities with their AI system. They are focusing on an ambitious use case: identifying the root cause of an incident in their codebase.
Any resolution process starts with an investigation, which is one of the trickiest part a response team has to tackle. As Google's Reliability Advocate, Steve McGhee said in an interview, "you're trying to find something that somebody didn't intend to be there."
This ability to find root causes is one of the most valuable SRE skills, but it's hard. Meta's objective is to assist their team in this quest through AI tools that help them reduce the search scope from thousands of diffs to only five root cause candidates.
Using a combination of heuristics and layers of ML technologies, Meta can narrow down thousands of code changes to five root cause candidates for an incident. They trained their models on past incidents historical data and achieved a 42% accuracy in identifying possible root causes.
The accuracy is pretty impressive given the complexity of the task and can greatly narrow down the work a responder has to do to determine what caused the incident and how to fix it.
Conscious that throwing responders into a wild-goose chase is a risk with this strategy, the Meta team ensures their AI system can explain why each root cause candidate was chosen. From there, responders can evaluate if it’s a reasonable thesis before doing a full deep dive.
Additionally, they also discard low confidence and avoid recommending them to the responders.
For Google, AI has an supplementary role in their incident response plan. It can help responders save time on tasks that a machine can be trained to do very well so they can focus on the response actions that only humans can do well.
For Meta, on the other hand, AI is actively working to find answers so that responders have a narrower problem space to digest.
Both companies are exploring ways to streamline their automated incident response through avenues that are compatible with their culture and ecosystem.