Cloudburst: Why good clouds go down

Today, business lives and dies by hyperscale cloud platforms such as Amazon Web Services (AWS), Microsoft Azure and Google Cloud. The vendors know that, and they do their best to keep their clouds up and running. But despite their extensive global network infrastructure, resilient designs and sophisticated systems, these cloud services are not immune to downtime.

But you already knew that. As Werner Vogels, Amazon’s CTO, once said, “Everything fails, all the time." So it's no surprise we've all had our clouds go down from time to time. And, sometimes, as with the recent Google Parisian crash, the failure lingers. Google's Europe-west9-a zone is heading towards its fourth week of degraded service.

What you may not know is all the most common ways that clouds can fail. Here, according to my analysis and way too much time in tracking down failures, are all the ways that clouds can crash.

1.     Hardware Failure

As the joke goes, a cloud is just someone else's computers. That's way too simplistic, but at the end of the day hyperscale clouds still run on physical servers, storage devices and networking equipment distributed globally. And, as with any physical device, these components can and will fail.

While redundancy and failover protocols are commonly employed to mitigate these risks, no cloud is completely immune to hardware failures. For example, back in 2011, an AWS EC2 outage occurred due to an overloaded network router. One thing led to another, and after a re-mirroring storm that led to massive control plane problems, it took days for everything to be back to normal. Even then, some data was permanently lost. 

2.     Software Bugs

Software bugs are an inevitable part of any complex software system. Hyperscale cloud platforms, with their millions of lines of code and constant updates and fixes, are just asking for failures. Usually, cloud providers mitigate this with cloud software deployment strategies such as rolling updates, blue/green updates and recreating deployment patterns.

But, even with all that, in every cloud, a few software failures will occur. For instance, Google Cloud experienced a "catastrophic failure"  in 2019. The cause? A configuration change intended for a few servers in a single region was mistakenly applied to numerous servers across several neighboring regions. The result? A huge mess.

3.     Human Error

As the old adage goes, "To err is human." The newer adage adds, "To err is human, but to really foul things up, you need a computer." I'd add that to make sure the whole world sees your error, you must make a mistake on your cloud. Configuration mishaps, incorrect settings or inadvertent command executions have all caused system-wide issues.

A notable example came in 2017 when a single incorrectly typed command caused a five-hour AWS outage during a routine debugging activity. The command ended up inadvertently taking more servers offline than intended, affecting several high-profile services such as Slack, Trello and parts of AWS's own service dashboard.

4.     Network Issues

Given the distributed nature of clouds, they are heavily reliant on network connections, both internally (between servers and data centers) and externally (to the end users). Network failures can be triggered by a multitude of reasons, such as ISP disruptions, DDoS attacks or even simple routing issues.

And, lest we forget, there's always the good-old canonical backhoe taking out a fiber optic line. According to the Common Ground Alliance, the organization that tries to prevent such incidents, in 2021, there were almost 50,000 episodes of telecommunication lines being busted by people digging where they shouldn't. Most of these don't amount to too much. But, some do.

It doesn't have to be a backhoe, of course. In 2015 OVHCloud suffered a car crash into its above-ground fiber optic lines to a major datacenter The result? Its available bandwidth was squeezed from 600 Gbps to 20 Gbps. Ouch!

5.     Natural Disasters

Last but not least, there are natural disasters, They may be rare, but they can cause significant, long-term service disruptions. While some disasters, such as 2012's Hurricane Sandy, which took out numerous Manhattan datacenters, make headlines, others are quite small… except for the effects they have on the cloud.

For example, Microsoft Azure suffered a severe outage in 2018 due to a lightning strike. This caused a voltage spike that led to significant cooling failure. The data center's network infrastructure had to be shut down due to overheating. This caused a ripple effect across Azure services globally.

The moral of our story? There are many things you can do to soften the blows of cloud failures, but we will never not have to deal with them. Outages will always be with us.


Want to learn more about data strategies for the cloud? Register for our Cloud Data Center Strategies virtual event here