Managing Chaos: Incident Response Edition

“When the page first goes off, I think maybe the most uncomfortable experience is if you're getting woken up three in the morning,” Emmett Walsh, Site Reliability Engineer at Glitch, explains what it’s like to get a technical incident page in this week’s Shift Shift Forward. “I would say, my first response is just pure confusion. I don't know what's happening.”

“When you don't know what's going on and everybody's affected... that is fear-inducing and stress-inducing. Panic-inducing maybe is the right phrase, at least for me,” Cori Schlegel, Senior Site Reliability Engineer said.

While incident response, the actions taken to investigate, fix, prevent and provide communication about a function that is not working as expected, is a team effort, Tasha Hewett, Senior Support Engineer at Glitch explains that it’s more specifically like an orchestra. “Everybody has their own part to play,” Tasha said, “and while we all know what each other's role is, we might not necessarily know how they physically play their part. So we have to sit back and let everyone play their own instrument.”

As a Support Engineer, Tasha shares with the community news about outages. In her words, she feels like the gong player of the Glitch orchestra. “I don't play much, but when I do, it's very important and the audience is going to hear it,” Tasha elaborated. “So I have to sit and wait and wait. And when is it my turn? Okay, here's my turn. I'm going to do a status update, or I'm going to post on Twitter. And it has to be at the right time and it has to be accurate. So it can be a lot of pressure, but it's an important role.”

And when Glitch service is restored, the job still isn’t done. “We go through an instant retrospective. We first try as best we can to understand what happened… then we get together everyone involved or [those] who might have domain expertise [and] we'll discuss what happened,” Emmett said. “We'll try and pick out things that went well in how we dealt with the incident. Be that from how we noticed it to how we dealt with it in the moment to how we communicated about it.”

Outages and incidents are the best learning experience, adds Anil Dash, Glitch’s CEO. “You learn more about your systems when they're broke then when they're working just fine. I think that part is really fun and can be very rewarding. Finally, once you've gone through the whole incident and you figure out what's actually wrong with the systems and you fix it, then you're done. That can be extremely fulfilling.”

*Check out Shift Shift Forward on iTunes or wherever you listen to your podcasts.