This post is also available in: Español (Spanish)
Lately, we are noticing a trend whereby traditional SANs are increasingly losing ground. This can be due to various reasons, such as forced migrations to the cloud, limited budget environments or even customers who are getting annoyed with their storage providers’ draconian terms and conditions. A common example is to find very costly maintenance agreements, higher-than-market disk size upgrade costs, as well as difficulties to expand/ improve hardware without having to completely replace it. There is also a strong trend to continue going virtual and trading servers like they were commodities. All the above is therefore contributing to an increase in this trend towards SDS, while the VSANs continue to attract new supporters.
In the past, we have tested the SDS features offered by Windows Server 2016 including Storage Replica. However, we have certain issues with Microsoft’s solution, which currently has only one subset of common features available for the SAN, in addition to its many limitations. Some of these limitations translate into lower flexibility when implementing changes, and that’s why other solutions such as the VMware VSAN have a wider market share. Without forgetting the old contender who has been offering this type of Virtual SANs for several years now, and whose industry experience and robustness is one step ahead of the rest. We’re obviously referring to StarWind and their Virtual SAN software, whose latest version includes a full feature free version. In this manner, and excluding the VTL functionality and graphic interface, we can test the product in a limitation free environment. Although this version is compatible with any production environment, access to the support features is only available for paying customers, so we shall depend on ourselves alone or the StarWind forums for support.
Azure Virtual Machines
In this post, we will look into how we can use this software on virtual machines which, in our specific case, will be Azure virtual machines:
In addition to the C: drive, we will add a P30 premium storage disk mounted on the S: drive. Following, we will proceed to download and install the software. This software can be downloaded directly from StarWind’s website by requesting a trial license or by selecting the free version. We shall begin with the installation in the first node in order to check that we are able to connect with the management console.
Our next step is to create a new virtual disk:
We shall select the above mentioned S: drive as the location and configure a 100GB virtual disk.
Next, we shall define the type of virtual disk, which is fairly important. This is because we can either choose between the traditional thick-provisioned disks or the LSFS device.
LSFS is a log-structured file system which writes logs of changes in sequential disk segments as opposed to the traditional virtual environment’s random writes caused by concurrent operations. This can help in the case of environments where the IOPS limitations are more restricted than the overall bandwidth. Which means that, in a way we would be doing something similar to the NetApps or SQL Server’s own transaction log in order to try and reduce the amount of operations and maximising sequentiality:
Following, we will proceed to configure the cache. In this type of VSAN solutions, RAM memory can be used as the memory controller for read and write caching. Initially, it may seem riskier to write to cache (write-back) but we must consider that the traditional setup replicates caching between several nodes, so that failure in one of these nodes will not imply any data loss. We would only be losing the data if all nodes within our solution were simultaneously switched off before these were able to flush the data to disk.
That would represent the highest cache level, and the next intermediate level would usually comprise flash disks. In our own case, it would not make any sense to use it since our storage is SSD already. However, this feature is very important when we want to use more traditional, cost effective disks alongside flash disks in order to increase performance (block-level automated tiering):
Finally, we shall indicate that we will be creating a new iSCSI target and we will finally create it:
Our next step is to prepare the machine that will be using that disk via iSCSI. Although this solution is compatible with other protocols such as SMB, which would be rather useful for example when configuring a traditional failover cluster, we shall use iSCSI because it is more generic and the most commonly used for this type of solutions. When using iSCSI, the disk will appear as a locally attached drive just like any traditional SCSI. By default, the iSCSI initiator is built into all modern operating systems, so all we need to do is to add a multipath to the already installed features:
If we forget this step, or the configuration is not done properly, we shall see that all disks will appear two or more times, depending on the amount of paths available:
iSCSI connections are simple and, in most cases, using the automatic target discovery will ensure that the disk connection will work correctly:
After the formatting and the installation of the multipath driver are complete, we will be able to check that the drive is correctly mounted (without any duplication) and, in this case, it will be associated to the O: drive.
Once the configuration has been prepared, our next step is to test its performance. For performance testing and benchmarking purposes, we will launch a series of synthetic tests that simulate typical SQL Server patterns using local drives. Then we would launch the same test but by accessing the drive via iSCSI and the SDS solution. Our last step would be to add high availability mechanisms (synchronous replication) in order to observe its behaviour. This would be similar to initially testing a standalone SQL Server installation, then adding a cluster and configuring an availability group with synchronous replication prior to retesting performance.
Configuring the second node is rather simple, although we must bear in mind that it will require a couple of network cards and that their performance will make an important difference. In our case, we will be using 10GBPS virtual network cards. Although this is the maximum available bandwidth, when dealing with a virtual environment which involves QoS policies and bandwidth restrictions, these may reduce the overall available bandwidth.
In order to add a replica we shall use the management tool and launch the replication manager:
We will select a different node where the software has been previously installed:
By selecting synchronous replication we will be able to support failures in both nodes which, together with the MPIO, shall provide transparency in case of failure:
The failover strategy shall be dependent upon the amount of nodes and communication channels available. In the end, we are referring to a similar setup than we would have to carry out in order to configure Quorum in a traditional Windows cluster. In our case, we have decided to use a HeartBeat, although using Node Majority may make more sense in installations with several nodes:
We can store the replica into either an existing device or create a new one. In our case, we do not have any device created in the target, so we will create it from scratch:
When the synchronisation is completed, we will check its status to confirm that both the communications and the HeartBeat channels are online:
In order to carry out a more comprehensive performance analysis, we will also add (once we have completed testing with 2 nodes) a third node to the replica. Again, we will have to be careful when selecting the correct interface for HeartBeat and syncing.
Although most of the results obtained were within our expectations, some of them were positively unexpected. We shall then begin to analyse the results in random writes. The blue bar below indicates that the performance for queue depth 1 in 64kb random writes drops when moving from local to remote access. As the amount of replicas increases, we can see how performance rises progressively, with the highest performance levels achieved with 3 replicas (one main and two secondary), which provides even higher performance levels than that achieved with the local copy. This speaks volumes about the software’s efficiency and its random writes management, since we would have expected to have a larger number of write penalties:
In operations involving further queue depth and smaller requests (depth 8 and requests of 8 and 64 KB) we obtain an improved behaviour by adding the software layer from the very first setup. This indicates that both the cache and the write log structured file system are working properly. A drop in performance levels becomes evident when moving from the 2 replica setup towards the 3 replica setup scenarios. This indicates a rather delicate balance between latency and throughput for this type of loads, which cannot be scaled up very successfully if we add too many nodes to our solution.
Finally, in the case of larger size requests of 128KB, we have a similar behaviour than that observed with small queue depth operations. Although an important impact is observed when adding remote access, performance levels will recover gradually as we increase the amount of replicas and even exceed those achieved in the local drive without any middleware. From a random write and availability performance point of view, we believe the 3 node solution to be most adequate.
In the following graph, we will observe its behaviour in regards to random reads. In this case, the impact of caching, readhead software techniques as well as the amount of nodes (which increase the overall cache size) will result in a generalised increase in the 2 node setup. Moving to the 3 node setup does not translate into any improvements, since we remain at approximately 360MB/s, which is the same as the 2 nodes setup. This is because we are reaching the network’s throughput limits allowed by the virtual machine and the drive itself. So if we wished to achieve a better performance, we would not only concentrate on the amount of nodes but also in their “size”. In many cloud environments, we find a correlation between the machine size (CPU, memory, etc.) and the maximum bandwidth (both in regards to the network as well as to the drive) that is available for use:
When we carry out sequential instead of random operations, we find that an increase in the amount of nodes results in less favourable conditions. An otherwise good performance with 64GB requests when using StarWind software drops gradually as we add more nodes. Logic dictates that, in these cases, a detriment is caused by the intercommunication costs when compared to continuous reads from a single node. Therefore, if we are working with an environment characterised by mostly sequential reads (such as a Datawarehouse), our best option will probably be to reduce the number of replicas.
In the case of sequential writes, we have observed a very similar behaviour. When carrying out sequential operations, we are accessing the same parts of the disk’s virtual device in an adjacent manner, which probably does not allow the software to take advantage of the similarities. If, for example, we had 300 blocks to manage in the master mode, and since we would be accessing the writes from the exact same block before moving on to the next one, by adding additional nodes we would only be increasing the syncing costs between nodes. This situation is the opposite of what we encountered with the random operations, since both the RAM cache as well as our ability to handle uniformly distributed requests (due to being random) in different nodes would provide performance advantages.
Therefore, the conclusion we have drawn is that this solution would be very adequate when our common operations are concurring and random, both in regards to reads as well as writes, resulting in major performance improvements in almost all scenarios. This could be the case of a multiple small transactional databases scenario or an environment with different types of virtual machines. In regards to SQL Server, it could be a good complement when migrating a traditional failover cluster to the cloud, where there is a high amount of consolidated databases.
A higher degradation shall be experienced in situations generating very low queue depth loads. An example of this type of scenario would be when carrying out row by row insert operations in a cursor/loop within a single database. In this type of situation, there is a critical disk latency as well as a low amount of low depth IOPS, resulting in a performance penalty. From an scaling and use or resources point of view, it wouldn’t be an optimal situation either when it comes to massive read and write sequential operations. In this type of loads, which are more related to a datawarehouse, it would be preferable to invest in vertical scaling from within our own cluster, by adding more disks, CPU, memory, bandwidth, etc. to the same machine. Fortunately, these types of environments are usually compatible with the use of high availability mechanisms, with less demanding RPO and RTO than a transactional system, therefore removing the need for synchronous replication.