It was 10:03 am, EST. My wife called: all their applications at work were down. They were worried as her company had completed their cloud migration to ‘DRT Cloud’ a few weeks earlier. No feedback from the cloud provider support, though.
I switched on my TV and lost my breath: “BREAKING NEWS - CLOUD PROVIDER DRT CLOUD WENT OUT OF SERVICE IN NORTH & SOUTH AMERICA AND EUROPE. RECOVERY TIME IS UNPREDICTABLE, COMPANY SAYS.”
How could that be possible? After years and billions of infrastructure investments worldwide!!
The day after, DRT Cloud clarified his network links went out of service, due to a major earthquake undersea. Service might remain unavailable for weeks…
An hour later, stock exchanges worldwide were falling.
‘DRT Cloud’ is a fiction name I decided to use, maybe being a little superstitious…
Could the above scenario happen in the real world?
It’s rather unlikely.
Cloud providers have invested in reliable infrastructure and redundant network connections worldwide.
There are threats, but the global scale is itself the best mitigating factor.
HOW MAJOR CLOUD PROVIDERS BUILT GLOBAL INFRASTRUCTURES AND NETWORKS
, and built a global infrastructure, with regions deployed worldwide, connected by resilient and performant network.
Regions redundancy isn’t always ensured at the Continent level though, meaning resilience to regional disaster can be challenging.
Let’s give a closer look.
is made up of , designed to be isolated from each other and deployed worldwide.
All regions have , which ensures isolation within a region
AWS global network is a fully redundant 100 GbE fiber network backbone, circling the world via trans-oceanic cables, often providing many terabits of capacity between Regions
AWS builds its own chips, servers, routers, storage, and load-balancers, to ensure the fastest path of innovation, standardization, and the highest levels of reliability.
I recommend to visit the .
is deployed worldwide.
Google is the largest .
I recommend exploring the .
is deployed worldwide.
is one of the largest networks in the world.
Also, Microsoft is strongly supporting .
IMPRESSIVE RIGHT? BUT ARE CLOUD PROVIDERS SERVICES IMMUNE FROM OUTAGES?
Photo by on
Well, not at all.
Major outages happened, taking down one or more regions for a significant time.
Let’s look at some of the major ones in recent times:
Wednesday, April 8, 2020: caused severe service outage for about an hour
March 24-26, 2020: affected many customers in Europe and the United Kingdom
Sunday, January 23, 2020:
Monday, November 11, 2019: impacted many services in two US regions and one region in South America for more than 2 hours
August 31, 2019, AWS US-EAST-1 datacenter in North Virginia leading to failure of the datacenter’s backup generators. It led to 7.5% of the EBS volumes and EC2 instances becoming unavailable. After the restoration of power, Amazon determined that some of them had incurred hardware damage with loss of data. Some customers faced extensive data loss questioning the security of the data stored in the cloud
Sunday, July 2, 2019: a affected YouTube, Gmail, and Google Cloud users like Snapchat and Vimeo,
September 2018, November 2018 and May 2019: . A data center outage in the South Central US region in September 2018, Azure Active Directory (Azure AD) Multi-Factor Authentication (MFA) challenges in November 2018, and DNS maintenance issues in May 2019
Friday, March 2, 2018: a (Ashburn), affecting hundreds of critical enterprise services like Atlassian, Slack and Twilio. Significant corporate websites and Amazon’s own service offerings were impacted as well
Tuesday, February 28, 2017, saw a complete outage from 9:40am to 12:36pm PST. Besides, many other AWS services that depend on S3 — Elastic Load Balancers, Redshift data warehouse, Relational Database Service, and others — also had limited to no functionality.
- outages happen
- impacts can be bigger than expected since a growing ecosystem of service partners moved their services to the cloud
- communication and transparency might be a challenge during outages.
Here are the links to real-time status pages for , and .
CLOUD REPATRIATION: ARE COMPANIES MOVING AWAY FROM THE CLOUD?
Photo by on
is happening, meaning “the shift of workloads from public cloud to local infrastructure environments, typically either a private or hybrid cloud environment”.
This is happening for several reasons, such as:
- poor initial planning, which brought to unexpected issues or outcomes below expectations;
- unexpected costs, storage for instance, or compliance issues, often for data
- security issues or worries
- insufficient operational readiness.
Repatriation can also be a “move forward” though, .
For example, leveraging “cloud at customer” solutions, such as , or , can ease compliance constraints while leveraging a full cloud model.
CONCLUSIONS: CLOUD PROVIDERS BUILT GLOBAL INFRASTRUCTURE AND SERVICES, BUT MAJOR OUTAGES HAPPEN, COMPANIES ARE MOVING BACK TO ON-PREM…SHOULD YOU REALLY START YOUR CLOUD JOURNEY? SURE! RIGHT NOW!
Photo by on
My point is that embracing the cloud brings huge benefits, but comes with challenges and risks which need to be properly understood and mitigated.
Cloud providers let your IT organization access global scale, consumption service models, built-in security and compliance, continuous innovation, growing service options with decreasing costs.
Thus, time to market, modernization, and transformation programs can be significantly accelerated, but cloud adoption has to be matched with a strong program to build enabling capabilities, at infrastructure, architectural, and application development levels.
There are trade-offs: complexity, costs control, performance, operational readiness, lock-in, skills, just to mention a few.
You can prepare and experiment, starting a wise and learning journey to bring immediate value, while mitigating risks and ensuring proper learning.
Here are my recommendations, based on my experience and what I’ve learned so far:
- select one or two cloud providers first, specializing their usage (IaaS, DB PaaS, ML for instance); cloud selection might be driven by existing contracts agreements (such as license agreements), to save costs and speed-up adoption and learning
- understand cloud service models, through low-risk POCs and training
- define basic use-cases, starting low and easy: labs, IT platforms, non-productive environments, low critical applications front ends lift and shift, cross-services adoption, such as IAM
- plan and design resilience and performance, for infrastructure (regions, zones, network) and up to architecture and applications layers; don’t forget proper management
- ensure security solutions (vulnerability and security events management at least) and related processes (who is going to manage security events and incident response)
- adapt and ensure compliance with your company’s business continuity and disaster recovery scenarios
- ensure data compliance and resilience (back-up in different locations, maybe on-prem)
- build basic costs control processes (spending per service, alerts, optimization opportunities)
- invest in capabilities to automate and foster cloud opportunities, such as , and
- define your exit strategy, enabled by the above elements, and required by many regulations worldwide.
While starting your “wise journey”, you can fill your IT Cloud Strategy, for instance by setting the path to relevant workloads migration of building cloud-native ones. All steps above will help you get to a clearer picture of how to proceed further.
“I am not afraid of storms, for I am learning how to sail my ship.”
Louisa May Alcott, Little Women
(Photo on top by on )