It is not uncommon to find a wide range of situations among our customers in terms of virtual machine performance with SQL Server. In many cases, we find situations where performance levels are far from ideal but, in general terms, virtual machines themselves are not to blame. What usually happens is that when we move SQL Server to a virtual machine, we become constrained by a maximum or limited amount of resources (CPU/ memory/ IO) that is significantly different to that of the physical machine.
Usually, we can see that the memory allocations become gradually more reasonable, in addition to a decrease in the previous ballooning issues. In the case of CPU’s, we continue to have some issues with CPU access latency in the case of highly critical loads, making them prone to the “noise” caused by other virtual machines. Generally speaking, we need to understand that the behavior of a CPU core with high average usage cannot be compared with what could be obtained with a “unloaded” core. And this is true for both virtual and physical machines when the CPU is being shared with other processes.
From a CPU point of view, the ideal situation would be having as many SQL Server dedicated cores as possible without “cheating”. By cheating, I mean when SQL Server is only able to effectively use only a low percentage of the physical core, around 20% as a matter of example. It would be best to have a lower number of virtual cores as long as these can be used up to 90% if necessary.
However, we still find that in most cases, the main “issue” is moving from a low disk latency environment with a predictable throughput, etc., towards a higher disk latency environment with a more variable throughput.
In applications involving high transactional loads such as ERPs, these circumstances can cause long delays in critical business processes such as monthly billing. That is why we want to show you how the performance of the same machine with identical CPU and memory can vary significantly depending on the disk performance. Assuming that the memory is enough, read/ write performance in data disks is not usually an issue for OLTP systems. Most of the issues are in the transactions writelog performance levels.
We will, therefore, create 3 identical copies of a TPC-C type database created with HammerDB. Those of who that are already familiar with this tool can download it from the official website (https://www.hammerdb.com/). Basically, what it does is to facilitate the creation of a test database that is similar to the TPC-C in order to launch a series of virtual users that will execute loads against it. It is also capable of creating a database and load similar to TPC-H, so we can also use this tool in order to carry out analytical load tests.
In our example, since this is a small virtual machine, we have only created 4 warehouses. To a large extent, the number of warehouses shall determine whether the loads are well distributed since 90% of the operations done by any users associated with a warehouse are expected to involve their local warehouse. Therefore, the larger the amount of warehouses, the easier it usually is to reach higher TPC values, provided that there is enough memory. In this case, our goal is to demonstrate how disk performance can affect performance while the rest of variables (CPU and memory) remain constant. More specifically, we will be using a VM equipped with 8 vCores and 16GB RAM where we will attach three different types of disks. Our slowest disk will be a RAID 1 array with SAS drives, our second will be an SSD SATA drive, and our third drive will correspond to the high-performance all-flash storage (from a leading brand, according to the Gartner graph):
All three databases will be identical, with the only change being the storage where these will be based. And the names designated for each correspond to their expected levels of performance. I.e., “tortoise” for RAID 1 array with the SAS drive, “kangaroo” for the SSD SATA drive and “cheetah” for the high-end all-flash storage:
We will begin with the tortoise scenario by launching 16 virtual users with no think time, in order to generate the required amount of loads in order to create an overload. During the load, performance levels stabilize at around 41K TPM:
If we check the most significant waits, we will see that we have a significant amount of locks as expected for this scenario. Secondly, we have WRITELOG waits with an average of 17.7ms waiting time. Generally speaking, we would expect average values between 1 and 5ms for the transaction log writes under “normal” circumstances. Or under 1 millisecond in the case of “good” performance levels.
Following, we will proceed to test the kangaroo database with the same setup used in the previous example. In this case, we can see that the performance levels increase almost fivefold, reaching just over 200K TPM.
If we analyse the waits, we can see that these are similarly distributed when compared to the previous case. I.e., we have around 60% of lock waits and 30% of writelog waits:
This improvement in performance levels is due to a decrease in the average waits. In the case of locks, these reduced from 61.5 to 16.3ms, with the writelogs decreasing from 17.7 to 5.7ms. These average write latencies may seem high for an SSD drive, and they actually are. But they are normal in case of device overload. We must also bear in mind that the write costs are much higher than the reads costs for most part of the SSDs.
And finally, we will launch the same test against the cheetah database. In this case, we can see the performance levels rise to 1.1M TPM, multiplying the SSD Sata drive’s performance by 5 and the magnetic SAS drive’s performance by 25 times. So we have 25 times’ additional performance by simply moving from a slow to a fast disk. A fact that should be somewhat “enlightening” when it comes to deciding where to invest in order to achieve good performance levels in these types of loads:
In this case, if we analyse the waits, we can see that there are several differences. We continue to see higher lock waits, representing almost 50% of the total waits. However, the average lock waits have decreased from 16.3ms with the SSD drive to just 2.2ms on average (7.4 times shorter). Furthermore, the writelog waits have been reduced from 30 to 7.29%, with a reduction in the average wait from 5.6 to 0.1ms (56 times shorter):
By using this ultra fast disk technology, we can get rid of the bottlenecks. So in our quest to achieve further concurrency and performance, instead of the disk being the primary cause of these high lock durations that indirectly affect the transaction times and, thereby the lock durations, we are now mainly held by the lock manager itself, the row storage (versus in-memory, for example) and the lack of extra CPU cores (or faster cores) overloaded by the fact that the VM is also using the same CPU.
We can also appreciate the importance of a fast IO in other tasks, such as an index REBUILD. Although the rebuild has an important CPU component, it is also susceptible to disk performance levels. If we launch a rebuild on a slow disk, we will see that it takes up to 35 seconds with a cold cache and 28 seconds with a warm cache.
use [tortoise] GO DBCC DROPCLEANBUFFERS GO -- Cold data ALTER INDEX [order_line_i1] ON [dbo].[order_line] REBUILD GO --CPU time = 11594 ms, elapsed time = 34786 ms -- Hot data ALTER INDEX [order_line_i1] ON [dbo].[order_line] REBUILD GO --CPU time = 11735 ms, elapsed time = 28423 ms.
In the SSD SATA drive database, although we saw an improvement in these, it was less than expected, going from 35 to 27 seconds with a cold cache and from 28 to 23 seconds with a warm cache:
use [kangaroo] GO DBCC DROPCLEANBUFFERS GO -- Cold data ALTER INDEX [order_line_i1] ON [dbo].[order_line] REBUILD GO -- CPU time = 11360 ms, elapsed time = 26986 ms. -- Hot data ALTER INDEX [order_line_i1] ON [dbo].[order_line] REBUILD GO --CPU time = 11594 ms, elapsed time = 22609 ms.
Finally, faster storage provides not only better latencies but also an improved throughput, resulting in a reduction in the rebuild times that decreased to 10 seconds, thereby obtaining almost the same times for cold or warm cache.
USE [cheetah] GO DBCC DROPCLEANBUFFERS GO -- Cold data ALTER INDEX [order_line_i1] ON [dbo].[order_line] REBUILD GO --CPU time = 9858 ms, elapsed time = 10363 ms. -- Hot data ALTER INDEX [order_line_i1] ON [dbo].[order_line] REBUILD GO --CPU time = 9522 ms, elapsed time = 9597 ms.
We can see the differences clearly on the following graph:
In conclusion, before migrating our physical services to virtual machines, we highly recommend considering their features. And although virtual machines will inevitably add a certain amount of penalties when accessing a resource, this fact alone is not a good enough reason to completely rule them out. There are also significant advantages that come along with a virtual environment so, except in very specific situations, we believe that this is a possibility that must be analysed. In this article, we have seen how the perceived performance levels for specific SQL loads depends significantly on the disk input/ output performance. So, with the same CPU and memory in a virtual machine, we can obtain highly variable levels of performance.
Latest posts by Rubén Garrigós (see all)
- Not all virtual machines are the same - May 29, 2019
- Network latencies over 1ms. Are they enough for a good SQL server performance? - April 10, 2019
- Latency, the worst enemy for any Hybrid Cloud environment - March 13, 2019