As you know, as long as you are not totally oblivious to the technological world you will have heard about one of the biggest bugs in the history of computer science (Spectre and Meltdown) and that its effects are real. So real, that we ourselves at SolidQ ourselves have experienced it in our own Query Analytics software. In this post I will try to shed some light on how to proceed if you detect performance regression in your solution with SQL Server, explaining how I have solved it in my own system.
TSQL Query Analytics is a software that works transparently for our clients, so that if you are a client of this tool, what you have is a system that is responsible for processing all the queries that arrive to your SQL Server system and adds them to a Datawarehouse, which you can then analyze graphically, it has a proactive alert system, …
TSQL Query Analytics is hosted in Azure and obviously has a series of databases behind it, which you as a good user of the tool would want to use:
The problem came with the surprise of the bug and its subsequent patches. Microsoft released an official statement on January 3 informing that the patches were already applied to all its Azure infrastructure. My surprise came when I came back after my holidays and saw how some of certain clients´reports were starting to go suspiciously slow … so much so, that even on some occasion I even saw timeouts …
Query timeout at powerbi.com
- Increase Azure machine capacity until performance returns to an acceptable level
- Software engineer to squeeze the capabilities of the current machine
Of course, one of the great advantages of being in Azure is that with just 2 clicks I could have increased the capabilities of the machine … but that easy solution has a monetary cost, that if you are reading this post you want to avoid. Having said that, and as a good SolidQ Mentor, I set out to do with my software, what I usually do with clients that hire me: “optimize the solution to lower costs while giving the best possible performance”. Needless to say that in SolidQ, we have TSQL Query Analytics analyzing itself, which is a great advantage in this case since it allows us to see not only the performance impacts, but what is causing them. Let’s see the results given by the tool, on its own performance (our users using the dashboards, that is):
Evolution of average query times (ms)
Let’s look at the average times from January 2nd. That day, I see a SQL Server instance restart looking at the event viewer and it’s the day prior to the Microsoft press release so it’s obvious that that restart is the fix … and moreover, everything fits. Just look at how much the times are increasing.
Before this, on January 9, I spent an intense morning analyzing the possible solutions that could be applied to sufficiently improve performance, making it perform as it did before. Finding the solution in this case was quite easy for me … as easy as using the same TSQL Query Analytics and going to the “Comparison: Reliability” tab.
In this tab we can see numerically the real impact of the performance regression suffered. For this, I simply compared January 3 (post patch) with December 27 (the previous Wednesday) to have the most accurate comparison possible. Let’s take a closer look at this comparison:
With this information, I can draw the following conclusions. For the same number of queries (~ 2.4% difference):
- 11.6% CPU cost increase
- this is what the extra machine suffers by serving the same queries
- 64.89% increase in duration
- This is what the user perceives, the response time
What I proposed therefore was to lower the I/O operations as much as possible, assuming that all these are neither more nor less than a lot of low level signals, which are also full of advanced operations to try to obtain the data that the engine thinks that you will need later, highly parallelizable, multithread … well, something that would be noticed. That said, what I did was just go to the tab “TOP 10 Query pattern weight (Reads/Writes)”
And I dedicated myself to optimize the top 3 queries for both reads and writes. The optimizations applied depended on the scenario … some were to modify indexing, others were to change some type of data, some were TVP function rewrites, … but whatever it was, the idea I want you to take away is that optimization always consisted of DECREASING THE SYSTME’S I/O LOAD. What caused the decrease in I/O?
An obvious decrease in Mb/s after my code optimizations
This ultimately resulted in a decrease in CPU usage
I/O decrease involves a decrease in CPU cycles
That led to the query response time returning to pre-fix times, and even better ones:
Decreased CPU cycles directly decrease average query duration times
Obviously I do not have inside knowledge on what kind of impact the patches have, but keep in mind that they entail significant performance regressions. In my scenario, minimizing I/O operations was something I imagined to be very affected …such as easily parallelizable, multithreading and low level operations (read this excellent post by Chema Alonso to understand why I reached that conclusion).
Finally, just remember that the effects of the patches on the Spectre and Meltdown bugs depend on your operational load. In my case, it is clearly related to large volumes of queries to I/O, but it may not be your case. Whatever it maybe, remember that the implications can be quite serious and that you forewarned is forearmed.
- Debugging applications against production data: obfuscation and GDPR - November 22, 2019
- SQL Server performance with Spectre and Meltdown patches - January 18, 2018
- Data Migration Assistant Helper - September 18, 2017