6 Steps SREs Should Take to Prepare for Black Friday and Cyber Monday 2021
Six tips on how Site Reliability Engineers (SREs) can prepare for the reliability challenges of Black Friday and Cyber Monday 2021
December 13, 2024
6 mins
Search and rescue (SAR) operations and incident response have striking similarities. In this series, Claire dives into lessons SREs can learn from wildfire management ICSs.
Claire Leverne is a mechanical engineer and outdoor-rescue expert with experience in automation, aircraft carrier systems, and drone building. An avid mountain biker and hiker, she’s also pursuing dual Master’s degrees in Environmental Management & Sustainability.
I always wanted to be an explorer. Take me back 300 years, give me a compass and upgrade my petticoat to a pair of pants and voting rights, and I would’ve been unstoppable! But alas, I was born in the 90s, so I became an engineer instead. To keep my sanity through my early career working long desk hours, I would pack every weekend with adventures: backpacking in the Olympics, rock climbing at the Needles, and mountaineering up Mt. Adams! These shenanigans led me in a round-about way to discover the world of search and rescue operations (and ski patrol), which was the perfect recipe of technical skill, grit, endurance, ownership, and teamwork.
When I consider writing about incident response, the first thing that comes to mind is not my years of troubleshooting automation code or responding to equipment failures down at the plant; I think of getting a 3AM call from our search and rescue duty officer, telling us to suit up to go find a lost hiker in the Pecos Wilderness. There are so many rich parallels between the world of search and rescue and tech incident management – from incident command structure to triage, real-time tracking to mission debrief. This post is the first of a series that will explore the practices and culture of SAR operations – and broadly the outdoor community – and what lessons we can apply to our techie day-jobs.
One of the cornerstones of effective search and rescue operations is the Incident Command System (ICS). Originally developed in the 1970s for wildfire management, ICS has evolved into a standardized response protocol across the United States since 2004, instituted nationwide in the aftermath of 9/11. It ensures that all hands involved in incident response, whether SAR volunteers, state troopers, or paramedics (or IT professionals!) operate within a defined structure; this structure ensures that communication, organization, and adaptability are maintained throughout the mission, even as scale and complexity welcome new dynamics to the scenario.
So, why should a tech team care about a system designed for rescuers in rugged terrains? Simply put, the principles of organization, communication, and adaptability are just as critical when you're racing against time to fix a system outage.
By nature of it being an emergency, time is in short supply; in terms of how humans respond under pressure, we are always a poorly-worded sentence away from chaos. Foundational to any collaborative effort is the establishment of common language: terminology for organizational functions, resource descriptions, and incident facilities. This helps streamline energy towards work that is productive, but also plays an important role in defining the criticality of incidents (triage) so we can put out fires before dealing with the harder tasks of cleanup and redesign.
When it comes to system complexity, modular organization is the key to scaling. In the world of SAR operations there are always bread and butter roles (rolls?) like the Incident Commander, or Section Chief; but as layers or personnel get added to the mission, there is a plug-and-play hierarchy that expands to utilize every new addition. Every group of rescuers has a leader, and every group of leaders has a leader. This ability to grow and contract as needed not only keeps the response organized but also ensures that every aspect of the incident is addressed promptly and thoroughly.
As rescuers arrive “on-scene,” they first report to Incident Command. There’s a formal debrief, and then it’s common for a map to unfold with detailed instructions for the path ahead, coordinates are assigned, and radio protocols/check-ins established. IAPs are meant to be precise, as unity of effort is paramount to an operation’s effectiveness, efficiency, and safety. Before we set out into the woods, we’ll never arrive if we don’t know where we’re going; it might not matter that we arrived if we arrive too late. Problem definition and goal identification is paramount to individual and collective success.
Throughout every rescue effort, roles are assigned for meticulous gathering, analyzing, and recording information. This ensures that events are responded to in real-time and at the appropriate level, but also this plays an important role in the mission debrief, lessons learned, and future action plan. Hindsight is 20-20. When you’re coordinating an evacuation and sweating on an exposed mountainside, catching a moment of reflection is about as likely as spotting a heffalump hurrying over to help. So many details – despite how bizarre – are missed at the moment; all those same details that are critical to discussions around continuous improvement. What went well? What didn’t? How can we avoid fixing the same thing twice? Post-incident reviews help teams learn and grow, mitigate future errors, and facilitate consistency and response times.
While saving lives isn’t typically synonymous with saving your company bunches hours in frustrated troubleshooting disorder, gleaning lessons from SAR operations has the potential to empower engineers to respond to incidents with the same efficiency and effectiveness that SAR teams use to save lives. This integration not only streamlines your incident management processes but also fosters a culture of preparedness, coordination, and continuous improvement.
Stay tuned for future installments, where we'll explore SAR topics around situational awareness, real-time tracking, triage, stress-management, and how to avoid being the hiker who inspired the 3AM call.
Make good choices, and remember to pack snacks!