When and Why to Build High Availability Into Your Azure Cloud Platform
High Availability (HA) is the design of a system or application such that if the server running that system fails or restarts, the application is back up and running as fast as possible. Often times the application needs to be back up and running faster than the server is able to restart, and HA facilitates that.
Microsoft Azure has a huge amount of HA built into the cloud platform. When it comes to the Platform as a Service (PaaS) offerings, all of the needed HA is simply built into the platform, and it is automatic. When working with Infrastructure as a Service (IaaS) offerings, though, the HA needed by services installed within the Virtual Machines (VMs) is not automatic. As such, you as the administrator of the cloud platform need to configure HA for these services.
Without having HA built into the platform, if a service goes offline due to a VM failure or a host failure, then the service will be down. If the host server that is running the VM fails, the Microsoft Azure cloud will start the VM on another host. However, that restart can take several minutes to complete. For many mission-critical services that companies are running within the Microsoft Azure cloud environment, several minutes isn’t fast enough. Often, these services need to be able to recover and restart within seconds in order to prevent degradation of the services. When the services are offline, the business functions that those services drive are unavailable. This results in lost revenue for the business. By designing HA into the environment, the services can be back up and running as fast as possible, allowing the company to minimize revenue loss .
Beyond the worry of a host failure, the services running on VMs also need to be safeguarded so that they keep running without interruption even if the VM’s operating system restarts for any reason. We accomplish that by implementing and appropriately configuring HA when it is not built in automatically. For such manual implementations, the type of application installed on the VM will determine the kind of HA you should configure. These fall into two categories, which I will discuss below: HA for stateless applications and HA for stateful applications.
For stateless applications such as web services running on either Windows or on Linux, you will need to deploy a scale-out solution. In that scenario, you run multiple VMs all with the same services on them and a load balancer between the application user and the web servers. In Microsoft Azure these load balancers can be the Azure Load Balancer, Azure Application Gateway, or Azure Front Door.
With the combined use of a load balancer and multiple servers, we can protect the service from outages by stopping user traffic from routing to the failed server while still sending user traffic to the other servers in the scale-out solution.
These stateless applications can be web servers, caching servers, or application servers—basically, any server that processes user data and doesn’t store the final data on the server.
Stateful applications require a different HA solution than stateless applications. While stateless servers can be scaled out, stateful servers such as database servers cannot be scaled out. This is because stateful servers must have the same data on them in the event that the server that is accepting the writes fails. In this context, you must use the HA solution that is built into the application. For example, for Microsoft SQL Server, you should use its Always On availability group. For stateful applications that don’t have a HA solution, if they’re using Microsoft Windows, you can use Windows Failover Clustering. With this feature, you can configure the application as a clustered service. This will allow it to failover, wherein the service switches to another server in the cluster within a few seconds in the event of the operating system failing or restarting.
For stateful applications on Linux servers, there are a variety of clustering solutions, including Pacemaker, libqb, or Kronosnet to name just a few. These packages allow you to configure a Linux application within a failover cluster so that the service will automatically switch between the members of the cluster within a few seconds of a failure.
No matter the operating system or the application, the concept is the same: Install the application on each server within the cluster, then configure the clustering software for the application service. This configures the application to run on only one server at a time, while the other servers in the cluster are standing by to run the service.
While HA does increase the cost of running the services within the Microsoft Azure cloud, the benefit is that the services will remain online through VM failures, VM operating system failures, and host server failures. While there is a cost associated with highly available solutions, the benefits will typically outweigh the costs. You can think of these costs like an insurance policy for the application. In the event of a server-level failure, the application will remain online, allowing the company to continue servicing its customers and continue making money.
HA acts as an insurance policy for applications, guaranteeing their availability through VM failures, operating system failures, and host server failures.
When configuring HA, choose carefully when placing those resources. Microsoft Azure has two different availability options depending on the Microsoft Azure region you are in. This is because some regions support availability zones, which allow you to specify that resources should be placed in different buildings within the specific Microsoft Azure region.
When using the availability zones feature, you are presented with (typically) three zones within the selected region, named zones 1, 2, and 3. While this doesn’t allow you to select the specific building within the Microsoft Azure region, it does allow you to group your resources into the same building as the other resources you have also put within the same zone. Using a two-node cluster as an example, you can configure one node of the cluster to be in Zone 1 and the second node of the cluster in Zone 2. This ensures that the nodes are in different buildings within the same Microsoft Azure region. This gives you additional availability, as the various buildings that are mapped to zones within an availability zone are guaranteed to be completely isolated from each other, with completely separate power and network infrastructure. With this configuration, you can ensure that your cluster is spread across different buildings within the Microsoft Azure region, and thus immune to any localized failure causes. Microsoft guarantees that clusters running with availability zones will have a 99.99% service level.
For regions that don’t yet support availability zones, Microsoft has availability sets. While availability sets won’t guarantee isolation between buildings, they do guarantee that the servers are placed in separate racks within the data center. Unlike availability zones, the administrator of the Azure account must create the specific availability set as part of the VM deployment process (or before creating the VM). Using the same example two node cluster as above, both nodes of the cluster are placed within the availability set, which then guarantees that the nodes are placed within two different racks within the datacenter. The racks are configured so that they are completely isolated from each other, with independent network switches and power circuits. With this configuration, every rack is completely isolated from each other. Microsoft gives a service level guarantee of 99.95% for clusters running on availability sets.
While not all Microsoft Azure regions have availability zones, they are available in a large portion of regions, and they are being introduced to more on a regular basis. Microsoft needs to make a lot of infrastructure changes in order to add availability zones to a region. Availability sets, however, were introduced to Microsoft Azure regions when each region was brought online.
While you cannot automatically move VMs from availability sets to availability zones, you can make the move manually with only a small outage to the VM, and it can be scheduled to minimize the disruption to the services and company.
High availability is critical to the proper functioning of many services on Microsoft Azure VMs. While there is a great deal of availability built into the Azure platform, you must manually configure it in IaaS contexts. This will look different depending on whether you are running stateful or stateless applications. You can optimize availability with proper planning and deployment of resources. While the HA solution will be different depending on the kinds of services running on the server, HA in general should be considered and designed for.