Gradually, as storage gets faster and local SSD storage becomes more popular, etc. disk access times are significantly decreasing. In these regards, perhaps the best example are the SSDs Optane systems, notable for their much lower read/ write latencies than with traditional SSD’s, in addition to being directly connected through the PCIe bus:

As shown in the above specifications, we have a typical read latency of ~10 us, i.e., 0.01 ms. In this context, or when the SQL operations do not require to read from/ write to disk, a latency below 1ms will be the longest delay experienced by our server. In fact, until recently, when a system had a major ASYNC_NETWORK_IO wait, it was usually a “very good symptom”, since it meant that we didn’t have “significant waits” in usual bottleneck systems. This is gradually changing, and we are starting to find environments where network performance is starting to become the main bottleneck and therefore, the main reason behind the slowness experienced either by users (particularly business users) or applications (especially in the case of ETLs or similar).

Those of you who are regular followers of this blog will remember that we have already discussed this issue in other posts (RBAR: ‘Row By Agonizing Row’ [in Spanish] and Latency, the worst enemy for any Hybrid Cloud environment). However, on this occasion, we will deal with either applications that require ultra low latencies or processes that are more prone to suffer increases in latency due to their RBAR design. In these cases, every microsecond counts, so we must optimize our setup in terms of BIOS, hypervisor, OS, network, etc. in order to obtain optimal performance.

From a network point of view, we continue to see that many companies are unfortunately still stuck with 1 Gbps circuitry. In some cases, these involve 2x or 4x port teams/aggregations in order to reach 2-4Gbps. However, the trouble with these aggregations is that they can’t always be used in a simple manner. So, although they provide further availability, increased total bandwidth, etc., on most occasions, we will be limited to a performance equivalent to a single port at 1Gbps for each TCP connection. In those companies where 10Gbps circuitry is already available, we will rarely find other limitations such as the firewall filtering performance, or a “legacy” switch that connects certain clients at 1Gbps or even 100 Mbps.

Also linked to the maximum bandwidth is often the commutation speed, router speed, etc. So we will not only be limited in terms of throughput, but also in regards to the end to end latency that can be obtained. For that reason, we will carry out many RBAR focused tests with loops and local cursors on several processors in order to find out how these will behave when adding the network in between. In addition, we will carry out tests with virtual versus physical machines, and affinitised virtual machines on the same machine versus machines located in different hosts in order to better understand the impacts of each decision. We shall repeat every test three times and show you the average values obtained.

But first, we will obtain a synthetic reference of the single thread performance that will predominate at the CPU level in these cursor/ loop oriented processes. We can see substantial differences between different equipment depending on the processor generation, and whether these are desktop or server processors:

If we also show the maximum operating frequencies, we can clearly see that there is a correlation, clearly showing that we cannot replace a high operating frequency core for “several” lower frequency cores without any consequences:

It is crucial to bear these factors in mind since it is not uncommon to find processes developed and tested in desktop machines with high-frequency processors showing much higher durations than expected when arriving at the QA or directly in the production environment. In other words, don’t expect the production server to go faster than your laptop or desktop machine since most of the times it will be the opposite.

Following, we will analyze the performance in two typical cases that are very connected to the monothread performance in T-SQL. On the one hand, the required loops in many joins, key lookups, etc., while, on the other hand, an “empty” cursor iterating, in order to merely check the row by row cursor structure performance itself.

For this reason, we will use a couple of simple scripts to simulate monothread loops first and with a cursor forward only later:

-- Loop join test, 1 billion rows
set statistics time on 
select count_big(*) from (select top 1000000000 t1.object_id from sys.objects t1,sys.objects t2,sys.objects t3,sys.objects t4, sys.objects t5, sys.objects t6) a    option (maxdop 1)
-- Cursor test, only 10 million rows
SET NOCOUNT ON
set statistics time off
DECLARE test_cursor CURSOR
READ_ONLY
FOR select top 10000000 t1.object_id from sys.objects t1,sys.objects t2,sys.objects t3,sys.objects t4, sys.objects t5, sys.objects t6   option (maxdop 1)
 
DECLARE @object_id int
OPEN test_cursor
 
FETCH NEXT FROM test_cursor INTO @object_id
WHILE (@@fetch_status <> -1)
BEGIN
       IF (@@fetch_status <> -2)
       BEGIN
              set @object_id=@object_id          
       END
       FETCH NEXT FROM test_cursor INTO @object_id
END
 
CLOSE test_cursor
DEALLOCATE test_cursor
GO

These scripts shall be executed locally, i.e., at the moment, the network will not be involved at all since we intend to simply check the processors’ capacity and the differences between real and virtual machines. We will start with the loop, where we can see again that we have an inverse correlation in this case, between the maximum frequency and duration:

In these types of operations, frequency makes the most significant difference, followed by a relatively small penalty when using virtual machines (5-10%).

Below are the results obtained from the cursor test, where you can once more see the inverse correlation between maximum frequency and duration:

In the case of the cursor, we can see that the “downward trend” showing the difference between each processor type, i.e., virtual or physical machines, is significantly higher. If we compare these extremes, we will find that the times have increased threefold. Whereas, in the loop scenario, the values weren’t even (although almost) duplicated in the worst possible case. Hence, the impact of virtual machines in these types of cursor based loads can reach up to 30% in a server environment.

I think it’s worth remembering that cursors must be the last alternative to be considered within a set-oriented language such as T-SQL. However, unfortunately, we still see these used on a somewhat frequent basis within OLTP environments to implement processes that reuse existing logic. For example, during a month end procedure, these are used to launch a stored procedure 100,000 times in a series in order to process invoices individually.

Following, we will carry out similar tests although including a network component. For this reason, we will use the two most extreme scenarios, since otherwise, we would end up with a very high amount of combinations between all of them. To prevent loading a grid in Management Studio or dumping to disk from artificially slowing down the process, we will set it up to discard any results received from the client:

Instead of adding up 1 billion rows, we will send these from the client towards the server, as though we intended to do the calculation at the client’s side. In the case of the cursor, instead of simply doing a variable SET, we shall scramble the variable’s value with a SELECT command:

-- Loop join test client-server, 1 billion rows
set statistics time on 
select top 1000000000 t1.object_id from sys.objects t1,sys.objects t2,sys.objects t3,sys.objects t4, sys.objects t5, sys.objects t6   option (maxdop 1)

-- Cursor test, only 10 million rows
SET NOCOUNT ON
set statistics time off
DECLARE test_cursor CURSOR
READ_ONLY
FOR select top 1000000 t1.object_id from sys.objects t1,sys.objects t2,sys.objects t3,sys.objects t4, sys.objects t5, sys.objects t6   option (maxdop 1)
 
DECLARE @object_id int
OPEN test_cursor
 
FETCH NEXT FROM test_cursor INTO @object_id
WHILE (@@fetch_status <> -1)
BEGIN
       IF (@@fetch_status <> -2)
       BEGIN
              select @object_id
       END
       FETCH NEXT FROM test_cursor INTO @object_id
END
 
CLOSE test_cursor
DEALLOCATE test_cursor
GO

We can see how these changes have significantly affected the average execution in all scenarios when compared with the previous tests. This is due to having included the client-server communication. However, if we analyze the data, we will find some curious facts confirming that performance shall depend on many factors, including the order of such factors.

If we analyze the durations in the first test, where we returned 1 billion rows between client and server, we will find the following:

From these results, we can draw several conclusions:

  • By using an Infiniband network instead of a 1Gbps Ethernet, we see a reduction of execution times by approximately 10% on
  • Durations seem to be more strongly conditioned by the client speed when it comes to “absorbing” the SQL Server generated rows. We obtain lower times when the faster processor is working as the client (whether physical or virtual).
  • When both client and server are based on the same host (whether a physical or virtual), the results are strongly influenced by the processor’s performance. So moving the client to the server is not always the best option if the server itself is not fast enough (when compared to the client).

If we analyze the results in the second test, where we returned 10 million rows in 10 million resultsets between client and server, we will find the following:

From these results, we can draw several conclusions:

  • With fewer amounts of rows (we went from 1 billion to 10 million), there is still an improvement when comparing the Infiniband and the 1Gbps Ethernet. However, the reduction is not as high as the previous 10%, but only 2%.
  • Duration is still strongly conditioned by the client’s performance, and the operation is clearly slower (around 600 seconds as opposed to 350 seconds) when the client executes it from a slow processor.
  • Again, when client and server are based on the same host (whether physical or virtual), the results are strongly influenced by the processor’s performance. So moving the client to the server is not always the best option if the server itself is not fast enough.

Another factor that we can analyze is the impact when the client is on a virtual machine as opposed to a physical machine (same processor, only on a virtual, not physical machine). In all cases, there is a penalty although on a different degree. More specifically, the operation involving 1 billion rows in 1 resultset, where the server is fast, and the client is slow, saw the most substantial penalty with approximately 0.03ms per row (30s extra on top of the total time). When the server is comparatively slower than the client, this virtualization penalty is reduced to 0.0025ms per row (~2.5 extra seconds in total).

However, when dealing with a high amount of resultsets, the opposite is true. Fewer milliseconds are added to each row’s absolute time when the server is fast, and the client is slower than in the opposite case. Added latency is very similar in both the 1Gpbs Ethernet case and the Infiniband example, leading us to believe that this is a result of network interface virtualization.

In view of these results, it would seem that the use of high-speed adaptors is unjustified since the impact is similar when using virtual machines and, in the best scenarios, we only achieve a 10% performance improvement. However, there are many cases where these are clearly a lot faster, such as with higher amounts of rows (e.g., a typical issue with DW environments). As a reference, we will launch the first modified query to return 8KB per row, a total of 8GB data:

-- 1 million rows, 8 KB per row = ~8 GB of data

set statistics time on 
select top 1000000 convert(char(8000),'a') from sys.objects t1,sys.objects t2,sys.objects t3,sys.objects t4, sys.objects t5, sys.objects t6  

If we observe the speed obtained, we can see that the 1Gbps card clearly produces a bottleneck. But when we use the Infiniband again, we find that the client’s performance, the data consumer, is decisive. When the slow processor consumes the data, and the fast processor sends it, we do not reach 8Gbps. However, when the fast processor consumes the data, we exceed 18Gbps:

In conclusion, there are several factors that can affect performance in a row by row process. Network latency is only one of them, particularly when high. But when it is low, we must remember that we will encounter additional latencies due to using virtual machines or caused by the differences in the “consumed rows per second” ratio arising from the processor’s monothread performance.

Depending on the type of load, we can benefit from the proximity between client and server by locating both on the same host or virtual machine (VM affinity). However, this is not always the best strategy, as we would be sharing host resources and the performance levels may improve if we have two machines that are only dedicated to the specific client or server role, as opposed to both. For example, in a client role, we may only need a small amount of memory and a few cores on “steroids” in terms of operating frequency.

In cases with a high amount of rows (loops, cursor rows), where we have a client/ server scenario and a low computational cost for each iteration, our tests show that client performance is more decisive in the total times than server performance. It is, therefore, crucial to measure any “dead” time between cursor iterations caused by the client. In this type of setups, it is not uncommon to find that a very high percentage of the total time corresponds to the line send delay. Thus, it may be more convenient to invest in improving client hardware (middleware, applications server, AOS, etc.) instead of the database server’s own hardware.

Finally, although the user of higher performance interfaces and network infrastructure (Infiniband, 40/ 100Gbps Ethernet) as opposed to 1Gbps Ethernet cards (if working correctly) shall only provide relatively moderate advantages in an OLTP environment, they may potentially become very high in a DW environment with higher data volume transfers where the 1Gbps (or even 10Gbps) Ethernet can result in a considerable bottleneck.

Rubén Garrigós

Rubén Garrigós is an expert in high-availability enterprise solutions based on SQL Server design, tuning, and troubleshooting. Over the past fifteen years, he has worked with Microsoft data access technologies in leading companies around the world. He currently is a Microsoft SQL Server and .NET applications architect with SolidQ. Ruben is certified by Microsoft as a Solution Expert on the Microsoft Data Platform (MSCE: Data Platform) and as a Solution Expert on the Microsoft Private Cloud (MSCE: Private Cloud). As a Microsoft Certified Trainer (MCT), Ruben has taught multiple official Microsoft courses as well as other courses specializing in SQL Server. He has also presented sessions at official events for various Microsoft technologies user groups.