Hitting reset: How COVID made us more tolerant of downtime

Published on the 09/07/2020 | Written by Mike Hicks

In the era of cloud and Covid, is downtime losing its stigma?…

These days when a public-facing service or app goes down, you’re just as likely to see understanding and solidarity as you are complaints.

Take the recent outage of enterprise collaboration tool Slack. While it occurred in the evening US time, Australian and Asia Pacific users awoke to an inability to collaborate, at a time when many more professionals than usual are working remotely and reliant on Slack and similar tools.

People – particularly those in technology professions, which likely make up much of Slack’s user base – immediately started sending their support to Slack’s engineers using the hashtag ‘HugOps’.

The discussion now is to what extent users are willing to lower their own expectations.

HugOps has been around for a few years, but is ‘a means of acknowledging the real humans working 24/7 to keep the services we rely on running as naturally as water’, Atlassian explains.

“It exalts empathy, cross-team collaboration and trust as the keys to finding problems and shipping solutions faster; and it is a feel good Twitter movement that is much welcomed on a platform that sees its fair share of negativity and trolling.”

HugOps’ spread shows IT is reclaiming the outage as a side-effect of using technology. While we – as IT – still want to know as soon as something breaks, we can ultimately empathise with teams trying to resurrect those services.

Traditionally, that patience has not extended to end users.

But, as we saw during the height of Covid-related panic and the onset of lockdowns, many e-commerce websites and corporate VPNs didn’t necessarily cope with a sudden influx of traffic.

In Covid, there is a general theme of cutting all sorts of operations extra slack. It’s a recognition that the situation faced by many organisations is unprecedented and therefore that while downtime is still inconvenient, it is at least understandable, and patience is warranted.

‘Acceptable’ downtime
It’s worth pointing out that more relaxed attitudes to downtime have been at least 20 years in the making.

Google ‘acceptable downtime’ and it returns over eight million results. One of the results on the first pageis from 2002, seeking responses to what is an ‘acceptable downtime percentage’.

Acceptable downtime is usually expressed in an ‘x nine’s’ availability format. Five nines, or 99.999 percent, for example provides for about five minutes of downtime a year. Consumer and/or free services are usually more forgiving.

But this is all an expression of what a service provider or IT deems acceptable. The discussion now is to what extent users are willing to lower their own expectations.

We can tolerate outages, but still want to know about them
So the broader question is: where to from here?

On the evidence before us, many business leaders believe that the ways of working, the practices and the tools taken up during Covid will stick and become business-as-usual as restrictions are lifted.

That means the challenges posed by a geographically spread workforce with variable home-based internet connections and hardware are likely to become permanent. So, too, a large number of cloud-based services that have enabled people to stay reasonably productive in spite of the circumstances.

Delivering an excellent end-user experience in the digital domain requires a seamless orchestration between multiple third parties and your own systems – all of which must transit across the internet and, in some cases, private networks.

This makes proactive monitoring and visibility even more important. While we may ultimately become more accepting of occasional downtime, given the complexity of the environments that underpin our new ways of working, we will still want to know where and why an outage is occurring.

Businesses that take a proactive approach to visualising all the components that matter will be able to more effectively manage outage-related risks, create effective outage recovery plans and measure resilience when those plans are called into play.

Mike Hicks is principal solutions analyst for ThousandEyes.

Post a comment or question...

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

No items found
Thank you! Your subscription has been confirmed. You'll hear from us soon.
Follow iStart to keep up to date with the latest news and views...