The extended outage

Ah, the extended outage. The dreaded dear-god-when-will-something-go-my-way-damnit outage. Every seasoned admin has had one, whether by accident or (faulty) design. This is the outage that impacts your customers for a long time, bringing pressure on you to “Just fix it”. As if it were, you know, like slapping a spare tire on the car or something. In this article I’ll proffer some advice to hopefully help your next extended outage hurt less.

The premise is simple. You have a system that is down, and has been down, for some time. It may be due to planned maintenance gone awry (like a patch that breaks a core service). It may be a tricky, nasty, bug that is well-nigh impossible to track down. It may be unplanned- the godawful hardware failure or hacker attack or DDOS. Regardless of the reason, the users are screaming and if it is not back up soon they will amass, torches in hand, at your gate. The temptation here is to just duck your head down, take the blows, and just Make It Work.  Doing anything else seems like a waste of time.

It isn’t.

The first thing you need to remember is communication. Communicating is key in these situations. You need to communicate in some way to your users and support staff. Keeping them updated will calm them, which is good for them and good for you(less torches = good).  As a customer, which would you rather hear- “They’ve been working on it for hours, but I have no idea when it will come up” or “The admins updated us twenty minutes ago and they are installing a new part they hope will fix the issue”? Being informed helps them to understand that you are, in fact, hard at work at getting them back to good. The other side of communication in these situations is teamwork. Get a conference bridge going if you can, with managers, vendors, and other admins on the call. It should be a floating bridge where people can hop on or off at will, and maintain the expectation that you will be around as much as possible.

The second part is delegation. Don’t be afraid to delegate- have someone go pick up the part or check the status of the backups while you do the things only you can do. There’s a tendency among admins to try to be self-reliant- a good trait, unless it lengthens the outage and runs them ragged. LET PEOPLE HELP YOU. I once had an appeasement engineer call me at 6AM en route to my 2AM-11AM fileserver debacle ask if I needed anything. I asked him to stop off and get me a triple-shot mocha and a bagel, which I attribute to my ability to make it to 11AM.  It seems like a small thing (and since it was on EMC’s card for a service call to a $250k machine, it was small) but it helped tremendously. Just remember to be courteous and kind and people will be glad to help you.

The third part is mitigation. Sometimes a service can be migrated, or temporarily serviced by another host. If your primary DNS servers went down, tossing up a caching resolver and letting it take the IPs for a few hours can help a lot, even if your authoritative domains are still down. Don’t waste a lot of time on bandaids, but if it is clear that you can mitigate the impact with a small amount of work(better: an amount of work you can delegate) do it. Stopgap measures, while they may seem half-assed, put forth the message that you are aware of how this affects your customers.

So, now the outage has gone on way longer than you wanted it to, and the pressure is out of control despite the measures you’ve taken above. Now it’s time to take a break. Grab a coffee or a water, walk around the building, grab a smoke, whatever. 5 minutes of taking care of yourself and letting your brain relax can make a huge difference in your effectiveness, both mentally and socially. Make sure people know you are just taking a minute to regroup and then do it.

None of these things fix the problem any faster, but all of them can decrease the impact of the issue and make everyone’s life a little easier in a troubled time. Hopefully they’ll help you in some horrible outage some day.

Leave a Reply