My advice on how to become resilient in the Cloud

Lars Wikstrom

1 year ago

Mastering Resilience in the Cloud

This continues from chapter one, the Digital Tsunami, which I posted just before the Azure outage (https://transanalytics.co.uk/the-digital-tsunami-a-fictional-wake-up-call-for-cloud-resiliency/).
The post takes you through my model for mastering Cloud resiliency.

First, I am the first to say that depending on your organisation’s Cloud footprint and how you utilise the Cloud, both the challenge and the investment can be significant. However, I suggest you position the current Cloud architecture pattern with one Cloud (my assumption) as IT-depth.

If the organisation does not see the importance of being resilient in IT, I suggest you use questions to get people on board. On the topic of questions, I recommend you read this article in HBR, which provides excellent tips: https://hbr.org/2024/05/the-art-of-asking-smarter-questions

“What was the cost of the recent Cloud outage?” ” What would happen if the outage lasted more than 2, 3, or 4 weeks?” and “Can we assume that DORA also will cover Cloud resiliency?”

My advice

First, we must consider this a general challenge with the Cloud. Before the Cloud, few, if any, would consider placing large business systems in a single data centre without the possibility of switching to a disaster recovery site. Today, it has become customary to assume that incidents like the one that just happened never happen. It reminds me of the Three Mile Island nuclear power station, which had a meltdown before that happened; it could also never have happened https://en.wikipedia.org/wiki/Three_Mile_Island_accident.

DORA has been invited in

After the recent Azure outage, Aunt DORA, the regulation (excuse the pun) that was once on the sidelines, has now been brought into the conversation. It wouldn’t be surprising if the scope of the regulation is soon expanded to cover resiliency in the Cloud more precisely. As the financial services regulators (FSA) are working to be proactive, you should start studying and implementing what is required to comply with the regulation. The formal version of DORA https://eur-lex.europa.eu/eli/reg/2022/2554/oj and something more easily digestible https://kpmg.com/xx/en/home/insights/2023/10/digital-operational-resilience-act.html

Reduce your risk exposure to a single Cloud provider

If you are using Cloud and your business can handle an outage for several days or weeks, this post is less attractive to you. However, if you, like me, believe that most businesses today can get severely damaged by an extended outage, perhaps you should continue to read. If you use a single Cloud provider, you are highly exposed.

You may want to start taking action now to reduce the risk of an extensive outage.
Regarding the work, it is essential to emphasise that to achieve success, one needs to focus on using principles. Principles help people better understand the reasoning behind the strategy and significantly increase organisational alignment.

Core principles for a significantly increased level of resiliency in Cloud

CLOUD VENDOR-AGNOSTIC TECHNOLOGIES MUST BE USED

The end-to-end architecture must be Cloud vendor agnostic. In practice, it cannot use technologies unique to the Cloud vendor. The technologies you use must be available from both Cloud vendors.

NO CLOUD VENDOR-SPECIFIC FUNCTIONALITY OR DRIVERS CAN BE USED

Even though a specific technology may appear to be Cloud vendor agnostic, the vendor has often added unique functionality, which you must be aware of and refrain from using.
You are either agnostic or not and do not want to discover that, at the time, it looked like a minor exception had been used across the landscape and now caused you billions in losses.
No vendor-specific functions, drivers or adapters can be used.

PROVEN END-TO-END TO RUN ON BOTH CLOUD VENDORS

We cannot live with a solution being said to run on both Cloud vendors; it has to be proven.
To prove the functionality, you must execute switches between the Clouds.
This then brings challenges like switching the whole cloud architecture or parts – segments. I will follow up on that in part 3.

SERVERS’ LOCATION MUST BE DNS ALIASES

At the time of a change, you do not want to start changing hundreds or thousands of network addresses manually; hence, no configuration, solution, or application can refer to an IP address.

DNS aliases must be used.

A CENTRAL SWITCH

I will not discuss switching between Cloud A and B, but it must be done centrally with a run book and tested with some frequency.
If you are in financial services, you should already have resources with recent experience and expertise on this topic; otherwise, this is a great time to refresh those skills.
Switching between Clouds must be done centrally with a run book and tested at some frequency.

DATA MUST BE REPLICATED

As with traditional data centre replication, data must be replicated between Cloud vendors A and B, whether in a database, file, or message bus.
Where latency is said to be too big, ensure that both production and DR are getting the same data.

The replication must include all data required to restore operation and maintain compliance levels, such as data, audit files, and the state of an application that will be restarted.

The challenge here is the latency between the clouds, and undoubtedly, it will be discussed much more. However, turning it over is another hurdle to avoid the possible large-to-disastrous consequences of a Cloud vendor’s outage.

SAAS MUST MEET THE SAME PRINCIPLES

As capabilities increasingly become SaaS (Software as a Service), they must be identical and available at both cloud vendors. Principle: upgrades in the SaaS for both vendors must be synced

The SaaS vendor must have redundancy on its own, be Cloud vendor agnostic, and equally practice switching between Cloud vendors

SCHEDULING AND ORCHESTRATION

The state of solutions and systems includes scheduling and orchestration. After an outage, depending on your chosen strategy, you will either restart failed processes or start processing from the recovery point.

LEAVE POLITICS outside

From the beginning, be open about politics being the organisation’s weakest link to becoming resilient.

It would be relatively simple if we kept politics outside.

Ensure you have a solid business case if there is a change in leadership.

Final Advice

To succeed with this important work, I advise engaging architects who focus on principles and bring in subject matter experts (SMEs) in each area. Start by developing a prototype that addresses the most challenging areas.

You will encounter many challenges and opportunities, but facing challenges is a part of life and makes our days interesting.

Are you up for the challenge? Let’s keep the discussion flowing and post your thoughts.