How Meta and Google use AI to improve incident response

Discover how Google is optimizing for accuracy in its AI strategy, while Meta strives to expand its response capabilities through machine learning.

Written by

JJ Tang

How Meta and Google use AI to improve incident response

The world population in 2024 is approximately 8.12 billion people. Of these, 4.3 billion people use Google regularly, while 3.74 billion are active users on Meta's platforms. Any disturbance involving these tech giants will surely make headlines, as seen in the recent Google’s Unisuper incident. The scale of these tech companies brings fascinating challenges in every aspect of their operations, including incident response.

When tens of thousands of diffs are being pushed to production every few hours, how do you tackle emerging incidents? Conceptually, the incident response process is not too different from response best practices: identification → coordination → resolution → closure.

Diagram describing a standard incident response process — Google’s incident response process, from GoogleBlog

‍

However, the ways in which Google and Meta are implementing AI incident response in different. In this article, I’ll break down the wins and challenges that each organization is experiencing while incorporating machine learning and generative AI into their incident management.

Google: More accuracy, fewer risks

Google’s approach to generative AI (Gen AI) in incident response focuses on leveraging the strengths of large language models (LLMs) while minimizing their potential drawbacks. They are primarily using AI to summarize incidents, which can save significant time to their response team without compromising the accuracy of their incident response process.

The security team at Google is using ML algorithms to gather all conversations, logs, graphics, code and any other event data related to an incident to to generate complete and accurate summaries that comply with their response guidelines.

Gains: Faster, Better Incident Summaries

After several rounds of optimizations, Google achieved a 10% improvement in scores for AI-generated summaries compared to human-written ones, as assessed by site reliability engineers (SREs) unaware of the summaries' authors. Google expects their team will spend 51% less time writing incident summaries thanks to AI.

Graph showing the expected time savings thanks to LLM summaries — Google’s expected time savings thanks to Gen AI drafts, from GoogleBlog

Google has also ventured into generating executive summaries for stakeholders, enabling incident commanders to speed up their communications. The results were on par with human-written summaries and helped responders prepare them in 52% less time.

Risks: Accountability

While generative AI incident summaries are practically indistinguishable from human-written ones at Google, thanks to their guidelines and standards, there is always a risk of LLMs producing hallucinations and errors. Therefore, Google requires a human intervention to approve each summary, making the person signing off the one responsible for its accuracy.

Meta: Harder Problems, lower accuracy

Meta follows a different approach to AI incident response. Instead of letting AI be a supplementary tool for their response team, Meta is aiming to expand their incident response capabilities with their AI system. They are focusing on an ambitious use case: identifying the root cause of an incident in their codebase.

Any resolution process starts with an investigation, which is one of the trickiest part a response team has to tackle. As Google's Reliability Advocate, Steve McGhee said in an interview, "you're trying to find something that somebody didn't intend to be there."

This ability to find root causes is one of the most valuable SRE skills, but it's hard. Meta's objective is to assist their team in this quest through AI tools that help them reduce the search scope from thousands of diffs to only five root cause candidates.

Gains: From Thousands to Five Culprits

Using a combination of heuristics and layers of ML technologies, Meta can narrow down thousands of code changes to five root cause candidates for an incident. They trained their models on past incidents historical data and achieved a 42% accuracy in identifying possible root causes.

Diagram: filtering stages to identify root cause candidates — Meta filtering stages after heuristics, from Engineering at Meta

The accuracy is pretty impressive given the complexity of the task and can greatly narrow down the work a responder has to do to determine what caused the incident and how to fix it.

Risks: misleading responders

Conscious that throwing responders into a wild-goose chase is a risk with this strategy, the Meta team ensures their AI system can explain why each root cause candidate was chosen. From there, responders can evaluate if it’s a reasonable thesis before doing a full deep dive.

Additionally, they also discard low confidence and avoid recommending them to the responders.

Conclusion: passive vs active AI

For Google, AI has an supplementary role in their incident response plan. It can help responders save time on tasks that a machine can be trained to do very well so they can focus on the response actions that only humans can do well.

For Meta, on the other hand, AI is actively working to find answers so that responders have a narrower problem space to digest.

Both companies are exploring ways to streamline their automated incident response through avenues that are compatible with their culture and ecosystem.