Dan Slimmon is an incident management veteran who's worked at Etsy and HashiCorp and now leads consulting and training on pragmatic, non-bureaucratic incident response. Dan is also an Infrastructure Engineer at Clerk.com.
Incidents Are Normal in Complex Systems
Physical, complex infrastructure like roads, power lines, and sewers apparently work almost all the time. But these systems are maintained by humans constantly and are subject to human action constantly. Yet, they break down. And when they do, we have to figure out how to fix them without interrupting the motion of the city.
In a similar way, the distributed systems that we build are complex and thus subject to breakage. In complex systems, incidents are totally normal. And that’s not complacency—it’s just unavoidable: a failure mode is going to fracture throughout your system faster than you can get behind it and fix it, and you're going to have an incident. You need to prepare for when it happens.
The Power of Hypothesis-Driven Troubleshooting
In the scientific method, you test whether a hypothesis is true or false. Dan’s suggestion is to apply the same methodology when troubleshooting an incident. To do that, though, you need to define hypotheses that are specific enough to be falsifiable. You have to be able to imagine a way to test the hypothesis. If you can’t test if it’s false, it’s not a hypothesis.
It sounds like a simple concept, but getting quality hypotheses is not easy at the beginning. However, once you get the hang of it and everybody’s comfortable with it and willing to have those conversations with each other, it leads to much faster incident resolution than a complete lack of structure.
Leadership in Incident Response: Social Skills Matter More Than Technical Expertise
Incident commanders leading an incident response don’t need to have all the answers. The key is to get the right people talking about the problem and making sure everyone is on the same page. In this regard, having junior engineers as part of the response team can be beneficial because they will ask questions that may be obvious to some senior engineers but most likely not to everyone in the room.
Incident response is largely about social and leadership skills rather than simply following a process or doing the technical work of fixing the problem. So there’s a lot to consider: How do I delegate? How do I get a bunch of people on the same page about what’s happening? How do I deal with somebody who’s gone down a rabbit hole about something that is not relevant?
"Nerd Sniping" as an Incident Response Tool
According to Dan, the power of counterfactuals is very real. He recalls asking for help on the Linux forums and, when running into an issue with X11, he’d say, "X11 sucks because this doesn’t work this way, it works this other way." And then somebody would write him “six paragraphs of, like, 'You're wrong, and here are the different ways it works, and here are the links to the code.'”
Dan claims that once he has encountered the problem, he delegates the task through "nerd sniping," essentially “tricking somebody else into painting the fence for me."
Proactive Incident Readiness: Learning from Near Misses
Systems aren’t like lawnmowers; they don’t just work until they’re broken. Dan recommends investing in regularly and collaboratively supervising what the system is doing when it is not broken. The best way he’s seen to do that is to arrange for groups of individuals, usually from different teams, to get together and look at telemetry, graphs, logs, and traces together. Then, they talk about what's weird about it.
This results in people finding things that are not yet outages but that, with a little bit of imagination, are easy to see how they might become an outage. That kind of thinking helps develop the idea that there isn't just a line between working and broken—there’s a continuum of behaviors. Eventually, it’s inevitable: one of these things is going to progress faster than we can catch it. And so, we’d better invest in that.