The Unofficial SRE Track for KubeCon NA '24
KubeCon doesn’t have an SRE track, so we’ve gone through the 300+ talks so you don’t have to. We picked the ones that we find more inspiring for reliability folks.
May 30, 2024
11 mins
Earlier this month, an inadvertent misconfiguration in an internal tool used by Google Cloud resulted in the deletion of a user’s GCVE Private Cloud. The user in question? UniSuper Australia — a $125 billion Australian pension fund with over 600,000 users. In this post, Ashley reflects on the communications shared and what we can learn from them.
Earlier this month, an inadvertent misconfiguration in an internal tool used by Google Cloud resulted in the deletion of a user’s GCVE Private Cloud. The user in question? UniSuper Australia — a $125 billion Australian pension fund with over 600,000 users.
While the incident itself was isolated to just one of GCP’s thousands of customers, Google issued both a joint statement during the incident and a public post mortem after it was resolved.
In this post, we’ll reflect on the communications shared and what we can learn from them. Please note: I am not affiliated with Google Cloud or UniSuper in any way. The thoughts and opinions expressed in this article are just that — opinions based on my personal experience.
{{subscribe-form}}
On May 8, while the outage was still underway, UniSuper and Google Cloud issued a joint statement:
A joint statement including Google Cloud’s CEO is certainly not a typical response to an isolated customer issue. The statement was brief, strategic, and salient. Let’s unpack some notable elements and possible factors behind this communication:
Attaching Thomas Kurian to this statement sends a strong message that this incident has reached the highest escalation point within the organization. As mentioned in the statement, a misconfiguration had been identified as the cause by that time. Personally, I think it’s highly unlikely that Google Cloud would have moved forward with a joint statement had they not yet confirmed they were at fault for the issue. It’s rare to see a CEO of a company of Google Cloud’s scale attached to an incident communication, period. More often, even for severe incidents, a CTO, CISO, or CIO is the senior executive used in a public acknowledgement. As your most powerful spokesperson, external use of your CEO in an incident can magnify the perception of the issue by immediately drawing additional attention to it. Or if done too frequently, putting your CEO out there every time something goes wrong can reduce the impact of their presence or damage their public credibility if not balanced well with positive coverage.
Because this statement was issued through UniSuper’s website, including Thomas Kurian added legitimacy to the statement (showing that UniSuper wasn't simply trying to pass the buck to their cloud provider for some murky technical problem).
Given the confirmed impact and causes associated with this incident, I agree with the move to attach both CEOs to the communications, and in doing so, making it clear why this is not a typical incident with a typical response, as they’ve done here with the phrase “one-of-a-kind occurrence”.
Given the isolated impact, this incident wouldn’t typically be reflected on Google Cloud’s status page. So, from a user or external perspective, UniSuper was solely responsible for the problem. Outlets like the Sydney Morning Herald reported the outage as “a technical glitch” and even began speculating about the involvement of possible “hackers”. Australia’s Finance Sector Union reported that “UniSuper is in the midst of an ongoing crisis”.
The press wasn’t their only problem. During this incident, UniSuper’s 600,000+ customers were unable to access their pension funds, left to wonder if their money had been stolen or had mysteriously vanished. Undoubtedly, UniSuper was under an extreme amount of pressure to explain what was going on. It’s likely that they responded to this by putting the pressure on Google Cloud to acknowledge their role in the incident and alleviate some of the heat they were facing from their customers.
Had Google been resistant to sharing the burden of this communication, it’s possible that they could have leaned on contract terms preventing UniSuper from naming them directly (again I want to reiterate—I do not know the details of UniSuper’s agreement with Google as their cloud service provider. I am drawing from general experiences in which third-party service providers have prohibitive policies around their customers naming them at fault for technical issues). That said, if it were to come out later on that they were at fault and didn’t acknowledge it, they’d face even more backlash. Regardless, they did the right thing by owning their mistake publicly.
One of the headings within the statement was: “Why did the outage last so long?”. The choice to use a question that deliberately emphasizes and acknowledges the duration of the incident shows an awareness that the intention of this communication isn’t to inform people that something was wrong, but rather to acknowledge an issue the audience was already aware of. The communication is directly addressed towards members of the pension fund, and not to the general public. By speaking to those affected in a way that acknowledges their existing concerns, Google Cloud and UniSuper are able to shift the conversation from reactive (responding to inbound requests) to proactive (stating what is coming next). UniSuper was also able to demonstrate their care for reliability by highlighting their backup through an additional cloud provider.
The companies also wisely used this communication as an opportunity to emphasize the uniqueness of the issue and ease other Google Cloud customers’ concerns around the possibility of recurrence or impact beyond UniSuper.
I would have liked to see a more concrete assurance in the “What’s Next?” section around a follow up post-mortem communication. Making a promise within your control and following through is a powerful tool in rebuilding trust. For example, they could have stated “Upon concluding our root cause analysis, we will share additional details regarding what led to this incident and the steps we are taking to ensure this does not happen again.”
The main messages distilled in this communication were:
In my opinion, Google and UniSuper did well to stick to these salient points in order to keep the communication brief and effective, though it could have potentially benefitted from including clarity on where/how additional communication would be shared, including a postmortem report.
Issuing a joint statement, especially one with two CEOs named, certainly attracts an influx of media attention. For companies who find themselves in a similar situation, it’s key to anticipate this wave and plan for how the slew of inbound media inquiries will be handled.
Upon full resolution of the incident, Google Cloud also published a detailed postmortem on their blog:
https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident
Let’s run through it and unpack!
Google did well to include this statement close to the opening of their post:
“Google Cloud has taken steps to ensure this particular and isolated incident cannot happen again.”
Whenever making a statement around “ensuring” a particular incident doesn’t happen again, it’s important to carefully choose the language used to avoid overpromises or misinterpretations, as they’ve done here by specifying “this particular and isolated incident”, rather than using a broader statement like “to ensure this never happens again”, which could be interpreted as referring to any incident leading to unexpected customer data loss.
Something I would have changed in the opener is the statement “The impact was very disappointing and we deeply regret the inconvenience caused to our customer.” While I agree with acknowledging the customer impact upfront, I find “inconvenience” often sounds canned and reductive especially given the impact of this incident. My preference would be something along the lines of:
“We take our responsibility as the world’s most resilient cloud infrastructure provider extremely seriously, and we deeply regret the impact this incident caused our customer.”
In clarifying the scope of the incident, Google Cloud made sure to emphasize what was not impacted, preemptively dispelling any concerns from other customers who might (reasonably) wonder: “Could this happen to me?”
As a service provider, there’s a fine line to walk between acknowledging impact to your services, and absorbing more downstream impact than necessary re: your customer’s customers. You’ll see that Google didn’t share any specifics about the number of impacted users for UniSuper or the nature of the data that was lost, despite these details being publicly available. They kept the scope focused on their system only.
Clearly defining the scope set the tone for the next section by making it clear that there was no impact to anyone outside the single customer.
They provided a concise “TLDR” summary of the events causing the incident before diving deeper into the intended purpose of the internal tool in question, and the reason the issue was able to go undetected.
Ultimately, the cause boils down to human error on the part of Google operators. Human error issues are some of the most difficult to communicate, and Google used a common strategy here to soften the delivery of this by using the passive voice to describe the events:
“During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank.”
While passive voice is often discouraged in professional writing, it can be useful when you want to bring less attention to the subject, as is often the case when admitting fault.
Google also used the “Diving Deeper” section to effectively anticipate and address likely follow up questions, such as:
While it’s not always necessary (or advisable) to include additional context outside of the immediate facts and events of an incident, when the events beg obvious questions, including some additional information can help you stay ahead of the narrative and demonstrate competency. It’s a careful balance between including too much extraneous information—which can often just lead to even more confusion or questions—and demonstrating transparency and good faith.
Reading between the lines of the recovery statement, it sounds like this essentially boiled down to manual labor to rebuild the customer’s private cloud with the data that was (fortunately) still available from their backup. Despite the callout around “Why did this outage last so long?” in the joint statement, Google Cloud lauded their “rapid restoration” of the problem in the postmortem (with a nod to their customer’s “robust and resilient architectural approach to managing risk of outage or failure.”) This is a subtle reframing that serves to adjust the public perception from “that went on a really long time” to “all things considered, it was resolved pretty fast”.
Ultimately, Google made a mistake and owned it. Personally, I would have considered providing a hint more insight into how they worked with and supported their customer here. Something like “We remained in constant communication with our customer during the recovery process, providing regular updates on progress until resolution.”. Assuming many Google Cloud customers will be reading this postmortem, this could provide some extra assurance around the level of support that can be expected in the event that a critical error on Google’s part impacts a customer.
The most important part of any postmortem communication: “How are you making sure this doesn’t happen again?”.
Google hit the essential points by making it clear that they’ve deprecated the tool that caused the problem and manually reviewed all GCVE Private Clouds to ensure that this automatic deletion problem isn’t lurking in any other configurations.
I would have liked to see some clarification on any other paths that may exist for automatic deletion. Have they reviewed all paths to automatic deletion and ensured reasonable safeguards against human error? Of course, preventing the possibility of human error entirely is impossible, but I’m sure customers and Google operators would rest easier knowing it’s really hard to accidentally delete an entire instance.
In the conclusions section, Google drew attention to the isolated nature of the incident, their existing safeguards, and the general reliability of their infrastructure. They also took the time to praise their customer’s contributions to the recovery effort.
This incident has many challenging factors for communicators — multiple organizations involved, big financial impact, human error, time-consuming manual remediation, etc. The communication strategy did an effective job of sounding authoritative and controlled, while acknowledging the human impact of the issue. Props to everyone who worked hard to rectify the problem (and thanks for providing a really interesting case study into incident communication!)