Understanding AWS Availability Zones: Boosting SaaS Resilience and Uptime

Resilience and availability are critical aspects of every SaaS application. By building your SaaS Serverless services on AWS' infrastructure, you automatically deploy to multiple availability zones within a single region, which provides increased resilience and automated availability failover mechanisms. As engineers and architects, it is imperative to understand these concepts and how to configure our services correctly.

In this post, you will learn about availability zones (AZs), what they are, why they are essential, and my tips for configuring resources across multiple AZs within a single region.

Availability Zones Definitions and Properties
1. Availability Zone's Properties
Why Understanding AZs Matters
1. AZs and the Peculiar IDs Case
2. When Can an AZ or Region Go Down?
Recommendations for Availability Zone Selection
Understanding Availability Zone's Cost
Summary

Availability Zones Definitions and Properties

When we create resources in a region, we create them in one or more availability zones.

The AZs (availability zones) work together behind the scenes to keep our service running by sharing the traffic, replicating the data in the databases, and taking over each other traffic in case of an AZ outage.

The specifics are sometimes hidden from us, but sometimes, we explicitly define the amount of AZs we want to use - VPCs, Aurora, and Fargate come to mind.

Let's define what an availability zone is. According to AWS documentation:

AWS resources are hosted in multiple locations world-wide. These locations are composed of AWS Regions, Availability Zones, and Local Zones. Each AWS Region is a separate geographic area. Each AWS Region has multiple, isolated locations known as Availability Zones.

As you know, "everything fails all the time" as Werner Vogles always says, so the concept of AZs is critical to make sure our service works even if one AZ has issues but the other AZs are online.

AZs provide higher availability, fault tolerance, and scalability than a single data center. Instead of having a single point of failure, you now have multiple zones to rely on in case one of the AZs has issues.

*There's also the concept of multi-regional services, which raises service availability a notch, but I will not cover it here.

Availability Zone's Properties

AZs have special characteristics according to AWS documentation:

Each region has a different number of AZs, but they all have at least 3.
All AZs in a region are interconnected via high-bandwidth, low-latency, fully redundant metro fiber.
Traffic between AZs is encrypted and supports synchronous replication.
Partitioning applications across AZs improves protection against issues like power outages and natural disasters.
AZs are physically separated by a meaningful distance, typically within 100 km (60 miles) of each other.
Your service might not utilise ALL the AZs.

When one AZ has issues in a region, AWS knows how to utilize the other AZs to keep your service and its resources going. Traffic will "automagically" shift to other healthy AZs and their resources. In addition, once the faulty AZ has returned online, some databases know how to replicate the data they missed.

All of this is managed for us by AWS. That's Pretty amazing if you ask me.

Why Understanding AZs Matters

Serverless developers don't usually think about AZs, as this information is handled by AWS behind the scenes. When using Serverless services like Lambda and DynamoDB, the AZs are selected for you, and you don't need to manage or maintain them. For more info, check the AWS docs.

However, when using VPCs or not fully Serverless services (you need to define their VPCs and they don't scale to zero) like Aurora, OpenSearch, or Fargate, AZs come into play, and it is important to understand their impact.

AZs can increase overall SLA and SLI as you are more resilient to a single AZ failure. They increase fault resilience and availability and have the potential to improve performance as AWS automatically balances traffic and access across AZs.

Let's take Aurora, for example, and review some of the advantages that deploying to multiple AZs brings:

Aurora stores copies of the data in a DB cluster across multiple Availability Zones in a single AWS Region. When data is written to the primary DB instance, Aurora synchronously replicates the data across Availability Zones to six storage nodes associated with your cluster volume - AWS docs

Other than protecting your databases against failures when one AZ has issues, it also allows AWS to conduct failovers in case of planned maintenance.

AZs and the Peculiar IDs Case

Availability zones have physical IDs and names:

AWS maps the physical Availability Zones randomly to the Availability Zone names for each AWS account. This approach helps to distribute resources across the Availability Zones in an AWS Region, instead of resources likely being concentrated in Availability Zone "a" for each Region. As a result, the Availability Zone us-east-1a for your AWS account might not represent the same physical location as us-east-1a for a different AWS account.

When we consider a larger scale than a single AWS account, such as an AWS organization with multiple services deployed across multiple accounts, the issue becomes even more significant.

If two services in two different accounts in that organization were to deploy to specific AZs by their names: 'us-east-1a' and 'us-east-1b', AWS would map to different physical AZs (over which you have no control) across your organization's accounts. While you think you are deploying your resources to the same physical AZs, that is likely untrue!

In order to control what specific AZ you deploy to, you need to know the correct mapping between name and physical ID.

We will cover a workaround with an AWS CDK code example later in this post.

When Can an AZ or Region Go Down?

An AZ can go down due to hardware failures, natural disasters, power outages, or other disasters.

A region goes down when ALL of its AZs suffer failures and are marked as "down." Such an outage affects all AWS customers deployed to this region. See the AWS case history, which goes back 13 years. These things happen!

Let's review some service failure use cases, from the simple to the edge cases.

The simple use case is that your service goes down when all the AZs it utilizes are down.

However, things can get more complicated. Let's assume you don't deploy to all the AZs in that region. The region has 3 AZs, and you use only 2 out of 3 AZs: AZs A and B. If availability zones A and B are down, you will still experience a regional failure on your application, even though the region still has one AZ functioning.

But it can get even more complicated. Let's assume that your SaaS service depends on another SaaS service in runtime. Both services are deployed in different AWS accounts.

Assume your service deploys to AZ 1 and 2, and service B, which you depend critically on, deploys to 1 and 3. In case AZs 1 and 3 are down, while you still have your AZ 2 available, you are essentially down, as your critical dependency, service B is down.

That is why understanding AZs and ensuring the SaaS organization uses the same methodology and the same AZs ensure availability.

Like everything else in software, it's all about pros, cons, and restrictions. Deploying to all AZs is the simplest answer, but it will cost you a lot.

Let's discuss my recommendations for AZ selection.

Recommendations for Availability Zone Selection

As a rule of thumb:

Two are better than one. Deploy to a minimum of two AZs. For consistency across the organization, see the code example below for how this can be achieved. Make sure all the organizations use the same AZs (1 and 2, for instance). See the 'AZ Selection via IaC' section below to see how to do it via IaC.
Critical SaaS services that are willing to pay extra for improved SLI and performance during partial AZ malfunction are encouraged to deploy to three or more AZs.
Before adding an AZ, it's crucial to calculate the cost of both the DATA TRANSFER and the extra infrastructure. This step ensures that the deployment remains cost-effective and aligns with the organization's budget.
For critical SaaS services that deploy to regions with more than 3 AZs, evaluate the cost to determine whether the added AZs are worth it. As a reference, such regions have failed before, so even a region with 6 AZs can be done.

AZ Selection via IaC

It is impossible to set AZ physical ids like 'use-1az1' (for us-east-1 region) directly in the AWS CloudFormation/CDK code. Instead, you need to provide the AZ names; as we know, they are mapped differently in each account to different ids. We need to determine what AZ name is mapped to AZ1, AZ2, etc.

You can do that in the CDK code using AWS SDK (boto3 for Python) to find the account specific mapping between AZ name and id.

Consider this Python code example that creates a VPC that is ALWAYS deployed to AZ1 and AZ2:

The magic occurs in lines 48-57. We iterate over the mapping and look for the AZ names that match the ID we want to deploy to. Simple and effective!

AZ Selection Tips for VPC

When configuring a Virtual Private Cloud (VPC), you have more control and must explicitly select which AZs to use. Typically, deploying your resources (such as EC2 instances or Elastic Load Balancers) is best practice across multiple regional AZs.

Use multiple subnets within the VPC, each residing in a different AZ. This allows you to spread your resources across AZs and maintain redundancy in case of an AZ failure.

On a side note, if you have Lambdas inside your VPC, they will be deployed according to the VPC definition, otherwise, it's up to the Lambda service to configure and choose the AZs.

AZ Selection Tips for Aurora

An Aurora DB cluster is fault-tolerant by design. The cluster volume spans multiple Availability Zones (AZs) in a single AWS Region, and each AZ contains a copy of the cluster volume data. This functionality means that your DB cluster can tolerate a failure of an AZ without any data loss and only a brief interruption of service.

We recommend that you distribute the primary instance and reader instances in your DB cluster over multiple Availability Zones to improve the availability of your DB cluster. - AWS docs

Understanding Availability Zone's Cost

Deploying and creating resources in multiple AZs increases cost. You pay for the deployed resources (Aurora replicas, EC2s, ALBs, Nat gateway, etc.) and, in some cases, for the traffic between the AZs.

Let's review the following scenario: We deploy an EC2 machine and an Aurora RDS MySQL cluster over a VPC with 2 AZs.

AZ traffic https://aws.amazon.com/blogs/architecture/overview-of-data-transfer-costs-for-common-architectures/

In this case, we will pay double the amount for the resources of the EC2, Aurora DB , VPCs, and other network parts (ENIs, etc.)

As for data transfer, you don't pay for the AZ data replication between the RDS instances nor data sent within the same region (between EC2 and RDS).

However, you will pay for any network cost between EC2s and for any data that is between EC2 and RDS from a different AZ (in case RDS is down in the first AZ).

Due to network and infrastructure deployment, you will pay more than twice as much if you have one AZ and add another.

However, as you add more AZs, the infrastructure deployment's per-unit incremental cost will be lower.

The cost of data transfer itself differs between regions, but for the most part, cross-AZ data transfer within the region costs $0.01/GB. If the updates are back and forth, you pay twice, $0.02/GB. Read more here.

Remember, the cost of adding an AZ can differ significantly between regions, so it's essential to consider these regional costs before making any changes.

Summary

In this post, we covered the importance of resilience and availability in SaaS applications built on AWS using Availability Zones (AZs). We defined what AZs are and why they matter and wrote concrete IaC code to configure resources across multiple AZs.

Thank you Maxim Drobachevsky and Meitar Karas on your help and review!