This post is also available in: Español (Spanish)
In an on-premises environment when we propose solutions to geographical disasters, the most common option is log shipping. The use of asynchronous database mirroring or availability groups with asynchronous replicas is also common but includes an additional risk that is not usually contemplated. We refer to the “speed” with which the changes are transferred, as quickly as the network and the target system allow us. This means that when the disaster has a human origin, an important error when we become aware of it, we will have this error replicated and applied. Obviously, a better solution would be to combine both options, which are not exclusive, with which we would cover more disaster scenarios increasing the cost of the solution.
Another common alternative in on-premises virtual environments (or physical ones as well) is to use some tool/solution that allows us to have snapshots of our machines replicated geographically. In the case of a cloud environment in Azure, we can implement this functionality easily using Azure Site Recovery (ASR). In this post, we will show the operation of this technology as an alternative to the use of log shipping thinking in those cases where we do not want to have to maintain the N log shipping that we need for a SQL Server instance manually or where we want to contain the cost of the solution even further.
For this example, we are going to create a couple of resource groups, one in the north of Europe and the other in the west of Europe:
In the north European group, we have deployed a virtual machine with SQL Server 2016 with all the associated resources of a disk, network, etc.
In the west European group, we will create a Recovery Services vault where the backups and disaster recovery configurations that we perform will be stored:
To protect our machine the process is very simple, we just have to go to the properties of the machine and select the “Disaster recovery” option in the operations category:
In the next settings screen, in our case, we will only modify the target resource group, which we will be the one we have previously created for the vault:
If we look, we can see that we can configure some options on this screen that can be useful. For example, we can change the storage type, and use non-premium discs in the DR environment:
We can also configure the replication policy. By default, we have only one policy created, with 24h of retention, but we can create a customized one that best fits our needs.
When we launch the replication process, we will be able to see the process graphically on a map:
We bear in mind that this process can take a while depending on the size of the machine and the configuration to be performed. In our case, the configuration took about 10 minutes:
Once the configuration is done, we can see the progress of the replication on a global scale on the machine:
In addition, we will also have information at the individual disk level, where we would highlight the data of how many pending changes there are at the source that has not yet been replicated:
Finally, when the replication finishes the status will change “Protected” and the RPO will be indicated, which will change depending on the time elapsed since the last snapshot:
From the SQL Server perspective, by default, backups will be processed via VDI:
This has the same implications as when using any system based on this technology. The main drawback is that the backup that is made will invalidate the differential chain from the last backup made by us, so if we try to launch a differential backup we will find an error as if there was no valid full backup.
If we want to avoid this problem we must make a change in the Windows registry so that VSS is used by the Azure Backup Agent (https://docs.microsoft.com/en-us/azure/backup/backup-azure-vms-troubleshoot#troubleshoot-vm-snapshot-issues y https://docs.microsoft.com/en-us/azure/backup/backup-azure-vms-introduction#windows-vm ):
From the point of view of consistency, we really have two points to consider. One is the “crash-consistent” and the other is “app-consistent”:
The App-Consistent is the one that VSS would use before as we have said, having a consistent snapshot of the operation and all the services compatible with VSS installed on the server. However, with the crash-consistent, basically what you get is the same result that we would have got if a “hard reset” was applied to a server. App-Consistent snapshots can be programmed with a minimum frequency of 1 hour. We must bear in mind that if we restore a crash-consistent recovery point we can have some problems with applications that are not prepared for it. In the case of SQL Server, it should not be a problem because the SQL Server database recovery process itself will be able to handle this situation without problems.
We must also think that ASR is much more than a “silo” for backups and that DR also offers solutions for VMWare, Hyper-V virtual environments, etc. so it can be a good ally in general for our DR strategy at the corporate level:
Once the replication is finished, the decisive moment arrives, testing that we can really raise the virtual machine in Western Europe in case of a crash. For this, we are going to connect to the original machine and we are going to create a process that inserts entries in a table every minute. In this way, we can verify how “fresh” the data is once we restore the DR environment. The following script creates a database with a table and inserts a record every minute:
create database testASR go use testASR go CREATE table test (fecha datetime) go while (1=1) begin insert into test values (getdate()) waitfor delay '00:01:00' end go
We will obtain the last records entered, to have them as a reference:
The next step will be to perform a test failover that basically performs the same actions as the failover, will raise the new machines, etc. but without affecting the current replication that will continue to work independently. We can select the recovery point that we want, the last of which will be the one with the lowest RPO:
The steps that will be followed will be dependent on the protected environment, but in this case, they basically consist of creating a new virtual machine clone of the previous one and turning it on. In this test, the total time needed has been less than 4 minutes:
We will connect to the machine created by the test failover:
And we will launch the query to check the last records inserted:
We can see that in this case, we have only lost one entry so the real RPO will have been greater than 60 seconds, but less than 120 seconds. Once we have tested that we can lift and access the DR machine, we will complete the process by performing a test clean-up:
The clean-up procedure is also quite fast, in this case, less than 5 minutes:
In summary, we have seen how we can protect a SQL Server virtual machine in Azure from a disaster by replicating from the northern region of Europe to Western Europe. The process of testing the DR environment is straightforward and fast thanks to the automation offered by Azure. Protecting a single machine is only a small step within a DR strategy at the corporate level, which must take into account many more factors, as well as the dependencies between the different resources, services, etc.