Why 100% uptime is a myth
100% Uptime: We have all heard of it and we all (rightly) strive to achieve it. The services we provide are key to the success of our businesses and therefore we want them to be constantly available.
Just imagine a potential customer attempting to visit your online presence only to be confronted with a server error. I’m sure you can envisage the instant bad impression this generates (“this company is unreliable and unprofessional”; “they can’t even keep their website online – why should I do business with them?”; etc.) .
There are service providers that will claim to be able to provide 100% uptime, but (unfortunately), it is a myth. Even when you remove planned maintenance windows from the equation, there are too many variables – too many things that could go wrong – to be able to honestly promise this level of service.
Consider RBS/Natwest, RIM/Blackberry and O2; they are all large, high-profile organisations – many of which are multinationals. Yet they have all something else in common too; they have all recently experienced serious system outages affecting millions of customers. They all spend millions if not billions of currency on infrastructure, but still they have problems from time to time. Even NASA is not free from glitches, which just goes to show that no matter what policies, procedures and checks are in place, there is always something that can go wrong.
Five reasons why it is impossible to achieve
- We live in a complicated world with complex interwoven systems. It is virtually impossible to be able to cover all single points of failure (even the ones that we have influence or control over).
- The law of diminishing returns: The cost of increasing the level of redundancy in systems rises exponentially; it becomes less and less cost-effective as you add more hardware, space to store it in and more complex software to manage it. Eventually you reach a point where it stops being financially viable.
- As systems grow more complex (redundant) it becomes harder to diagnose problems when they arise.
- No software is absolutely bug-free.
- Human error: This is often overlooked, we all make mistakes and even the most reliable and proven system can become a victim of human error.
But my business needs 100% uptime
Rather than focussing all efforts on chasing the elusive 100% uptime, it is my belief that you should develop recovery and contingency plans to follow if one or more of the services you rely on have an outage. Of course, strive to achieve the best uptime that you can, but have well-communicated plans in place for when one of the links in the chain fails.
What can I do?
- Agree penalty clauses with your service providers for any downtime incurred – e.g. crediting your account
- Introduce a backup policy with a quick and predictable turn-around time for restore. I.e. in the event of disk failure, how quickly will you be able to have your backups back on-site and restored?
- Store contact details for your customers in an additional alternative system so that your business can continue to function if your (e.g.) email service or phone system is unavailable.