TransAnalytics

My advice on how to become resilient in the Cloud

The London finance centre exposed.

Mastering Resilience in the Cloud

This continues from chapter one, the Digital Tsunami, which I posted just before the Azure outage (https://transanalytics.co.uk/the-digital-tsunami-a-fictional-wake-up-call-for-cloud-resiliency/).
The post takes you through my model for mastering Cloud resiliency
.

First, I am the first to say that depending on your organisation’s Cloud footprint and how you utilise the Cloud, both the challenge and the investment can be significant. However, I suggest you position the current Cloud architecture pattern with one Cloud (my assumption) as IT-depth.

If the organisation does not see the importance of being resilient in IT, I suggest you use questions to get people on board. On the topic of questions, I recommend you read this article in HBR, which provides excellent tips: https://hbr.org/2024/05/the-art-of-asking-smarter-questions

“What was the cost of the recent Cloud outage?” ” What would happen if the outage lasted more than 2, 3, or 4 weeks?” and “Can we assume that DORA also will cover Cloud resiliency?” 

My advice

First, we must consider this a general challenge with the Cloud. Before the Cloud, few, if any, would consider placing large business systems in a single data centre without the possibility of switching to a disaster recovery site. Today, it has become customary to assume that incidents like the one that just happened never happen. It reminds me of the Three Mile Island nuclear power station, which had a meltdown before that happened; it could also never have happened https://en.wikipedia.org/wiki/Three_Mile_Island_accident.

DORA has been invited in

After the recent Azure outage, Aunt DORA, the regulation (excuse the pun) that was once on the sidelines, has now been brought into the conversation. It wouldn’t be surprising if the scope of the regulation is soon expanded to cover resiliency in the Cloud more precisely. As the financial services regulators (FSA) are working to be proactive, you should start studying and implementing what is required to comply with the regulation. The formal version of DORA https://eur-lex.europa.eu/eli/reg/2022/2554/oj and something more easily digestible https://kpmg.com/xx/en/home/insights/2023/10/digital-operational-resilience-act.html

Reduce your risk exposure to a single Cloud provider

If you are using Cloud and your business can handle an outage for several days or weeks, this post is less attractive to you. However, if you, like me, believe that most businesses today can get severely damaged by an extended outage, perhaps you should continue to read. If you use a single Cloud provider, you are highly exposed.

You may want to start taking action now to reduce the risk of an extensive outage.
Regarding the work, it is essential to emphasise that to achieve success, one needs to focus on using principles. Principles help people better understand the reasoning behind the strategy and significantly increase organisational alignment.

Core principles for a significantly increased level of resiliency in Cloud

CLOUD VENDOR-AGNOSTIC TECHNOLOGIES MUST BE USED

NO CLOUD VENDOR-SPECIFIC FUNCTIONALITY OR DRIVERS CAN BE USED

PROVEN END-TO-END TO RUN ON BOTH CLOUD VENDORS

SERVERS’ LOCATION MUST BE DNS ALIASES

At the time of a change, you do not want to start changing hundreds or thousands of network addresses manually; hence, no configuration, solution, or application can refer to an IP address.

A CENTRAL SWITCH

DATA MUST BE REPLICATED

As with traditional data centre replication, data must be replicated between Cloud vendors A and B, whether in a database, file, or message bus.
Where latency is said to be too big, ensure that both production and DR are getting the same data.

The challenge here is the latency between the clouds, and undoubtedly, it will be discussed much more. However, turning it over is another hurdle to avoid the possible large-to-disastrous consequences of a Cloud vendor’s outage.

SAAS MUST MEET THE SAME PRINCIPLES

As capabilities increasingly become SaaS (Software as a Service), they must be identical and available at both cloud vendors. Principle: upgrades in the SaaS for both vendors must be synced

SCHEDULING AND ORCHESTRATION

LEAVE POLITICS outside

From the beginning, be open about politics being the organisation’s weakest link to becoming resilient.

It would be relatively simple if we kept politics outside.

Ensure you have a solid business case if there is a change in leadership.

 

Final Advice

To succeed with this important work, I advise engaging architects who focus on principles and bring in subject matter experts (SMEs) in each area. Start by developing a prototype that addresses the most challenging areas.

You will encounter many challenges and opportunities, but facing challenges is a part of life and makes our days interesting.

Are you up for the challenge? Let’s keep the discussion flowing and post your thoughts.

Exit mobile version