• Frewsletter
  • Posts
  • 3 Tips For Handling Mission Critical Outages (From a Dude Who’s Been Through Too Many)

3 Tips For Handling Mission Critical Outages (From a Dude Who’s Been Through Too Many)

For all your Mission Critical SaaS operators out there…this one is for you. 

Most SaaS Businesses aren’t really that critical.  If they’re down for a day, it’s kinda annoying, but your customers aren’t going to leave or lose too much revenue. Some may blaze up your customer support queue, but that’s more a factor of their own temper than it is about your business most of the time.

But if you’re operating multiple mission-critical businesses, then incidents, outages, and crashes are urgent, critical business threats.  There is no “I’ll get to that in a minute” with these incidents. 

This is where there is not a 9 to 5.

Fortunately, it’s been a while…

It’s been over two years since any of my mission-critical services had a true stability issue that knocked customers offline. 

As anyone thats been through those events - whether employee, contractor, or owner - knows that feeling of terror as you realize it's happening. 

Yet, in July 2024, a QuotaGuard customer's dedicated traffic was briefly interrupted. We got alerts that both load-balanced proxies were unreachable within 15 minutes of each other. Very odd...

It didn't take long for us to restart everything, but it did result in a few minutes of down time.

But what can I share to help others like me that find themselves operating mission critical services that will, inevitably, have issues?

My Three Tips for Other Critical SaaS Operators

It’s been a few weeks since the events, yet my blood pressure still shoots to the moon anytime I see a ticket come in, or an alert hit Slack, that might indicate further issues. 

There’s three lessons that I can share from going through these situations a few times in the last few decades of IT fire-fighting with Appointment Reminder, QuotaGuard, PutsReq/Box, Gigalixir, and Extendware

Calm Your Nerves First

I’m usually the point person on customer service with those affected during these outages. You likely are too.

At this point in my career, I'm not dealing with technical issues any more, so my best place in these emergencies is on the keyboard communicating with those affected.

It's important to me that they have instant communication of the status and where we are on the road to fixing the issue(s).

On the flip side, that also means I get an unvarnished, raw understanding of the negative effects our issue is causing to their company and their work lives. 

The most important first step is to shove thoughts of “is this going to kill my company?” out of my mind.

If you're worried about that, you can't help your customers.

Close your eyes, become emotionless, and emerge robotic. Solve the problem, update customers, repeat

Many times, all I can say is “we’re working on it…” because telling customers the truth is a better approach long-term than the Silicon Valley Customer Service approach of “ignore tickets for a day, then reply with “what is the issue you’re seeing?” and treating your customers like they don’t know their own infrastructure. 

It’s embarrassing to say "we're working on it", because when you’re trying to figure out the problem, there’s not much else to share. 

At the least, it’s slightly comforting to the customer that you're in communication, being open and honest, even if it's not exactly what they want to hear.

Focus On What Best Helps The Customer

As stated above, to me, the best way we can help the customer is through communication and updates.  It can be tough when you have dozens of interactions going on with customers, but once you’ve calmed your own nerves, this is the best thing you can do for your customers.

If you’re still down, admit that you’re still down.

If you’ve found the problem and working on a resolution, share it.

If a solution didn’t work, tell them that “attempt one wasn’t successful, so we’re trying the next logical solution.” 

Empathize with your customers, because they are likely getting hit on the inbound customer service queue of their company asking what’s wrong.  They are stuck in the middle of something they can’t control or resolve, so remember their predicament when they’re communicating with you. 

Prepare in Real Time for the Next Incident

Keep a log of issues, sticking points, and “this could be better” issues that you find with your current active incident procedures. 

Every incident exposes weaknesses in your process, I don’t think it can ever be perfect. 

Grab paper and jot things down during the incident. I usually don’t even have time to app-switch to type out notes, but I can jot and draw on paper very quickly. 

Once the incident is over, and you finally relax, set a To Do to review those items a few days later when you can get some perspective.  Implement what you need for the next time (because it’s technology, it will happen again).

Calm Yourself, Focus on the Customer Communication, & Prepare for the Next Incident

Those three points are really it.

I don't need to make a "7 tips to handle your next customer service issue" when really, it's just three that are top of mind during an incident.

I hope it'll be another 2+ years before the next one...but we still have to prepare like it'll be tomorrow.

Writing this out to share with the public is part of that post-incident process...I hope it helps someone out there that finds themselves in a simliar position some day.