RescueOps - Ep. 4: Situation Awareness and Real-Time Tracking
Whether scaling a mountain or troubleshooting an outage, situational awareness and real-time tracking can help your team build resilience and minimize costly delays.
June 4, 2021
5 min read
Sometimes, as these 4 incidents highlight, major failure results from a mere typo or configuration oversight.
Ask an SRE to name the most serious threats to reliability, and typos probably won’t feature on the list.
But maybe they should. Sometimes, an errant keystroke is all it takes to bring critical systems offline. And unlike text messages and email, these systems don’t usually have autocorrect tools to step in and save the day.
To prove just how important it is to double-check your spelling when configuring cloud services, DNS records and other systems that play a central role in delivering a smooth end-user experience, here’s a look at four incidents where failure resulted from mere typos or simple oversights.
S3, the Amazon cloud’s object storage service, is a critical part of the software stack of thousands of major websites. When it goes down -- as it did in March 2017 -- those sites go down, too.
The cause of that outage, which lasted about four hours, turned out to be errant input that Amazon engineers entered when performing debugging on some of the servers that support S3. They meant to shut down just a few servers, but apparently took down a large number due to the typo within the command. Restarting everything took several hours, hence the outage.
What can SREs learn from the S3 outage? One lesson, of course, is that it’s always important to double-check the arguments you are issuing to your commands. Most CLI tools will happily do whatever you tell them to without questioning it. Want to rm -rf your home directory? Don’t count on Bash to stop you.
But we all make mistakes -- even seasoned AWS engineers. That’s why the other takeaway here is that it’s critical to build resiliency and fallbacks into your systems. Even S3, which famously promises up to “11 9s” of availability, can fail. In a perfect world, the websites that crashed when S3 went down would have been able to fail over to backup storage -- perhaps a local copy of S3 data, or a different S3 region that wasn’t impacted by the typo-inflicted outage.
Of course, backup storage strategies like these are complicated to implement, and they could significantly increase storage costs. If you want to mirror S3 across regions, you’ll basically pay at least double. So you can’t fault these websites too much for going down along with S3.
The Amazon S3 typo was hardly a one-off type of event. Cloudflare, the anti-DDoS and cloud reliability company, experienced a similar issue in July 2020, when a “configuration error” -- which we can only assume was some kind of misentered data or missing field within routing rules or similar configurations -- caused much of the company’s global traffic to be routed through its Atlanta location.
The flood of traffic predictably overwhelmed the Atlanta router and degraded connections from other locations across the world that depended on it. A number of major websites were impacted, including Medium, GitLab, Shopify and Politico, among others.
To its credit, Cloudflare was able to restore normal service in less than a half-hour. We imagine its engineering team had some pretty good monitoring rules and incident response playbooks in place to enable that level of rapid resolution.
Cloudflare was also very transparent about the issue. Its CTO was straightforward in explaining what had gone wrong and which steps the company had taken to prevent the recurrence of a similar issue. It was a case study in how to communicate openly with your users (and the public at large) when you make a mistake -- even one as simple as a configuration error.
SSL certificates can feel like a necessary evil. Keeping them properly configured and updated is a tedious task, but it’s also a critical one for ensuring a secure user experience.
And when you forget to update your certificates -- as Microsoft did for its Teams service in 2020 -- systems will go down. After a certificate expired, users could no longer authenticate with Teams. It took Microsoft a few hours to fix the issue.
The cause of downtime in this case wasn’t exactly a typo. But it’s an oversight that falls into the same bucket: A configuration error that is easy to make but that can have major consequences. As such, it’s yet another reminder of the importance of validating configuration data to catch simple mistakes that your engineers have overlooked -- in this case, a certificate that was about to expire.
The millions of lines of code that power modern commercial aircraft can do some amazing things, like automatically calculate how to lift planes into the atmosphere.
But these systems are only as good as the data that humans put into them. And when a typo leads to the wrong input, the results can be catastrophic.
That’s what almost happened in 2017 in San Francisco, where a pilot accidentally entered the number 10 into the system instead of 01 to indicate which runway the plane was going to depart from. As a result, the plane’s takeoff systems operated on the assumption that they had a much longer runway than they did.
In the event, the plane ended up taking off without issue. But it was a close call, and a reminder of how very simple data input mistakes can lead to very big problems.
There is also, we think, a lesson here about opportunities for smart data validation. It presumably would not be too difficult to build systems that allow planes to verify automatically -- using geo-sensors, for instance -- which runway they are actually taking off from, in order to double-check the data that humans input.
To make typos is human. We all do it, and we always will, despite steps that engineers may take to minimize the risk of data input mistakes.
That’s why it’s important to build resilience against configuration errors into systems whenever possible. In a perfect world, no software system or service would ever assume that data input is correct. It would take additional steps to validate the data, lest the error cause servers to go down, routers to crash or planes to fail to take off.