This post is also available in: esEspañol (Spanish)

As you know, as long as you are not totally oblivious to the technological world you will have heard about one of the biggest bugs in the history of computer science (Spectre and Meltdown) and that its effects are real. So real, that we ourselves at SolidQ ourselves have experienced it in our own Query Analytics software. In this post I will try to shed some light on how to proceed if you detect performance regression in your solution with SQL Server, explaining how I have solved it in my own system.

 UPDATE: Intel confirms performance regression and also that the major impact is in I/O

TSQL Query Analytics is a software that works transparently for our clients, so that if you are a client of this tool, what you have is a system that is responsible for processing all the queries that arrive to your SQL Server system and adds them to a Datawarehouse, which you can then analyze graphically, it has a proactive alert system, …

TSQL Query Analytics is hosted in Azure and obviously has a series of databases behind it, which you as a good user of the tool would want to use:

Dashboard analyzing 2 bilion queries while inserting 2420 queries/sec

The problem came with the surprise of the bug and its subsequent patches. Microsoft released an official statement on January 3 informing that the patches were already applied to all its Azure infrastructure. My surprise came when I came back after my holidays and saw how some of certain clients´reports were starting to go suspiciously slow … so much so, that even on some occasion I even saw timeouts …

Query timeout at powerbi.com

Which is obviously not a matter of pure chance and then I thought that the bugfix applied would have something to do with it. Given this situation, I had 2 options:

  1. Increase Azure machine capacity until performance returns to an acceptable level
  2. Software engineer to squeeze the capabilities of the current machine

Of course, one of the great advantages of being in Azure is that with just 2 clicks I could have increased the capabilities of the machine … but that easy solution has a monetary cost, that if you are reading this post you want to avoid. Having said that, and as a good SolidQ Mentor, I set out to do with my software, what I usually do with clients that hire me: “optimize the solution to lower costs while giving the best possible performance”. Needless to say that in SolidQ, we have TSQL Query Analytics analyzing itself, which is a great advantage in this case since it allows us to see not only the performance impacts, but what is causing them. Let’s see the results given by the tool, on its own performance (our users using the dashboards, that is):

 

Evolution of average query times (ms)

Evolution of average query times (ms)

Let’s look at the average times from January 2nd. That day, I see a SQL Server instance restart looking at the event viewer and it’s the day prior to the Microsoft press release so it’s obvious that that restart is the fix … and moreover, everything fits. Just look at how much the times are increasing.

Practically a 60% drop in performance

CPU evolution average queries (ms)

Remember that we are seeing times of a system that is showing analysis results on 2 billion queries

Before this, on January 9, I spent an intense morning analyzing the possible solutions that could be applied to sufficiently improve performance, making it perform as it did before. Finding the solution in this case was quite easy for me … as easy as using the same TSQL Query Analytics and going to the “Comparison: Reliability” tab.

Full data of comparison reliability sheet

In this tab we can see numerically the real impact of the performance regression suffered. For this, I simply compared January 3 (post patch) with December 27 (the previous Wednesday) to have the most accurate comparison possible. Let’s take a closer look at this comparison:

Performance regression measured pre and post fix

With this information, I can draw the following conclusions. For the same number of queries (~ 2.4% difference):

  1. 11.6% CPU cost increase
    • this is what the extra machine suffers by serving the same queries
  2. 64.89% increase in duration
    • This is what the user perceives, the response time
Performance regression of 64.89% … next to nothing

What I proposed therefore was to lower the I/O operations as much as possible, assuming that all these are neither more nor less than a lot of low level signals, which are also full of advanced operations to try to obtain the data that the engine thinks that you will need later, highly parallelizable, multithread … well, something that would be noticed. That said, what I did was just go to the tab “TOP 10 Query pattern weight (Reads/Writes)

Query performance analysis

And I dedicated myself to optimize the top 3 queries for both reads and writes. The optimizations applied depended on the scenario … some were to modify indexing, others were to change some type of data, some were TVP function rewrites, … but whatever it was, the idea I want you to take away is that optimization always consisted of DECREASING THE SYSTME’S I/O LOAD. What caused the decrease in I/O?

An obvious decrease in Mb/s after my code optimizations

Notice how despite the increase in CPU usage and duration, the average Mb/s is about the same before and after January 3

This ultimately resulted in a decrease in CPU usage

I/O decrease involves a decrease in CPU cycles

That led to the query response time returning to pre-fix times, and even better ones:

Decreased CPU cycles directly decrease average query duration times

Obviously I do not have inside knowledge on what kind of impact the patches have, but keep in mind that they entail significant performance regressions. In my scenario, minimizing I/O operations was something I imagined to be very affected …such as easily parallelizable, multithreading and low level operations (read this excellent post by Chema Alonso to understand why I reached that conclusion).

Finally, just remember that the effects of the patches on the Spectre and Meltdown bugs depend on your operational load. In my case, it is clearly related to large volumes of queries to I/O, but it may not be your case. Whatever it maybe, remember that the implications can be quite serious and that you forewarned is forearmed.

Enrique Catalá

Enrique Catalá es Mentor en SolidQ, Microsoft Data Platform MVP desde 2013 e Ingeniero en informática. Es Microsoft Certified Trainer (MCT) y está centrado en el motor relacional SQL Server, tanto en la resolución de problemas de rendimiento y escalabilidad en sistemas OLTP como la definición e implementación de entornos de alta disponibilidad confiables, en donde ha llevado con éxito más de 100 proyectos no solo en España, sino en diferentes países como EEUU, Holanda, México, Arabia Saudí, Austria... Es el arquitecto principal de las soluciones para SolidQ llamadas HealthCheck, SQL2Cloud, SCODA y del generador de SSIS de SolidQ. Ponente habitual del SolidQ SUMMIT, miembro y ponente en SQL PASS tanto en España como Iberoamérica ponente en varios SQLSaturday