In the last few years, we are increasingly finding more hybrid environments where some SQL Servers are being migrated to the Cloud. In these cases, other applications, services, ERPs or even SQL Server instances continue to be based OnPremise in the initial data center. This means that in the event of any connections between both environments, these will be restricted by bandwidth and higher latencies, as opposed to other connections that do not go across both environments.
In this blog, we have already discussed this issue in the past in regards to row oriented operations where each row incurred network latency penalties (https://blogs.solidq.com/es/dotnet_es/row-by-agonizing-row-rbar-y-la-latencia-de-red/ in Spanish). However, there are other situations where, although we may be working with dataset oriented operations, the total performance is lower than expected. Before we start, we will use iPerf3 to check the “raw” performance levels that we have in a local connection:
Therefore, starting with our initial scenario of local network latency, we are able to reach just over 800Mbps as an average, with a 953Mbps peak, which are rather typical values in a 1Gbps local network. We will now use the Clumsy tool (https://jagt.github.io/clumsy/download.html) in order to add network latencies from 1 to 100Ms:
If we perform the same iPerf3 test, we will see that the speed that we are able to achieve with a single connection in our test environment will fall abruptly as the latency increases. No optimisation from a TCP or transport point of view has been carried out in this network, we are using the default Windows Server 2016 TCP stack values:
This data can be compared with public reference data such as the following (https://docs.microsoft.com/en-us/azure/expressroute/expressroute-troubleshooting-network-performance#latencybandwidth-results) in order to see how the curve is similar, with very abrupt falls even with small latencies:
If we compare the times for a SQL Server query to move a certain amount of data via the network, we will see that the total time also increases with the latency. For example, in the case of this query that moves 5MB of data, we can experience delays from 40ms to 383ms.
-- 5 MB select top 5000 convert(char(1000),'') a from sys.objects s1, sys.objects s2, sys.objects s3, sys.objects s4, sys.objects s5
In terms of gross data, a few milliseconds may not seem much. But in the context of a 1Gbps network where we can move over 100MB of data per second, we would expect to be able to move 5MB in one twentieth of a second (50ms). What would happen if we moved 5GB instead of 5MB? In this case, instead of 40 seconds, we would have to wait 383 seconds (6 minutes and 23 seconds), which already starts to cause significant time deviations in high frequency load ETL packages. And, if instead of moving 5GB we moved 50GB (which is not so unreasonable based on some clients’ examples) we would be speaking of going from 6 minutes and 40 seconds to over 1 hour.
Therefore, if we need to move high volumes of data via the network, we must not only find out the available bandwidth but also the latency to be expected between both ends. Depending on this latency, if we want to make the most of the available bandwidth, we may have to redesign our processes to be multi threaded and combine the results back together at the destination.
For example, if we need a full month data, we would be able to easily divide the load into days and carry out N extractions, one per day, and then join the M threads. It would not be unreasonable even to do a single thread for each day in the month, so that we may have between 28 and 31 concurrent threads depending on the total days in each month.
In this manner, it would be really easy to implement by simply adding an extra filter to our query “where datepartday,columnDate)= 1,2,3,(…),31”. Although the above implementation may be useful in many cases, it will not work so well in others. When mining data from a data warehouse, these normally have a physically optimized design to work with date ranges. So it would probably be a better idea to use a filter with a smaller range “where columnDate>= ‘20180101’ columnDate< ‘20180102’”. In either case, it would be of great help if we knew the data model, since we could have, for example, splits by weeks or any other time period that we should consider when selecting the ideal ranges.
What happens if there are no dates available to split the data? In that event, our first option would be to use any field with enough variability to generate a fair split using a module function. For example, an auto-number column could be an ideal candidate so we could use a filter similar to “where (id % 10)=0,1,2,3,(…),8,9”. If we didn’t have any such columns, we could use a text type column that is as “unique” as possible for each row. For example, in a National ID number column, we could carry out a CHECKSUM and perform the module on top as “where ABS(CHECKSUM(dni)) % 10 = 0,1,2,3,(…),8,9”.
Finally, if we had no similar cases to those above, with the previous 5MB query returning only text strings with blanks, we could use NewId() as a random source as “where ABS(CHECKSUM(NewId())) % 10 = 0,1,2,3,(…),8,9”. However, the above approximation would prevent, for example, the implementation of “partial load retries”, since each execution would return different rows for each thread.
select top 5000 convert(char(1000),'') a, ABS(CHECKSUM(NewId())) % 10 filtro from sys.objects s1, sys.objects s2, sys.objects s3, sys.objects s4, sys.objects s5
And last, but not least, in those cases where the query execution cost is significant (e.g., if there is a complex view, large sorts, etc.) it is likely that the system may not be able to manage 10 concurrent requests resulting in making the situation worse by saturating the data source. In these cases, we may be interested in materializing the global result, for example, into a staging or global temporary table in the source (to avoid the network impact) before subsequently reading from this flat table with the corresponding filter.
Summing up, whenever we are dealing with a hybrid environment, we must carry out tests that simulate the adequate network characteristics. Unfortunately, in general terms, only a simple bandwidth comparison is done and anything higher than the “current average use” is accepted as valid. Although the impact of latency is well researched, we must still analyze the most adequate technique to reduce it on an individual case by case basis. Failing to do that, can result in a major impact in cases where there is a large number of applications, ETLs, etc.