The purpose of this article is to expand on the already familiar White Paper, “Integration Services: Performance Tuning Techniques”, which was written by Elizabeth Vitt and Hitachi Consulting, and published at TechNet in 2006. Some of the concepts in that White Paper will be reproduced here, both in the interests of clarity and to link the two documents – thus enriching the reader’s knowledge of SQL Server Integration Services’ inner mechanisms.

The evolution of SQL Server Integration Services performance concepts will be discussed, with the focus on the new engine features which aim to improve parallelism and how they impact on various situations analyzed in the aforementioned white paper. In addition, buffer structure and behavior will be reviewed, in order to give a better idea of how they fit into this new performance scenario.

“Aim at perfection in everything, though in most things it is unattainable. However, they who aim at it, and persevere, will come much nearer to it than those whose laziness and despondency make them give it up as unattainable”

Lord Chesterfield

“Good things, when short, are twice as good. And even the bad, if short, is not so bad”

Baltasar Gracián y Morales


 

It is human nature to wish to improve, and in an over-competitive world, every action – every step – should aim at perfection.

SQL Server Integration Services is no exception to the rule, and this article will discuss the internal components of the SSIS run-time engine, how to modify them, and the way in which these changes could impact on the performance of our package runs.

It is important to understand what SSIS buffers are, and how to deal with them when it comes to accelerating or optimizing a package’s performance. Execution trees are fundamental to understanding how Integration Services organize data flows, and therefore how data is distributed along the pipes. Combining these elements will lead to specific examples of parallelism and design patterns

Buffers and execution trees

What is a buffer?

A buffer is a delimited memory zone used by SSIS to hold the data that it is dealing with during the execution of a package. Its metadata is invariable once design is complete. In other words, its metadata (fields, data types, etc) cannot be changed at execution time. However, its content can be. The process of manipulating the data inside the buffers is the job of SSIS transformations.
Since SSIS buffer addresses are fixed in memory, using them is extremely fast. There is no need to either copy information between buffers or to re-allocate the buffers themselves.

Buffer usage

Use of buffers by SSIS transformation type

Row-by-row transformations: Rows are processed as they enter the
component, thus, there is no need to accumulate data. Because it is able to
use buffers previously created (by preceding components/precedents), it’s not
necessary to create new ones and copy data into them. Examples: Data
Conversion, Lookup, Derived Column, etc.
Partially blocking transformations: These are usually used to combine data
sets. Since there is more than one data entry, it is possible to have huge
amounts of rows waiting, stored in memory, for the other data set to reach
the component. In these cases, the component’s data output is copied to new
buffers and new execution threads may be created. Examples: Union All,
Merge Join, etc.
Fully blocking transformations: Some transformations need the complete
data set before they start running. Therefore, these are the ones that impact
on performance the most. In these cases, as well, new buffers and new
execution threads are created. Examples: Aggregate, Sort.
SSIS reuses previously used buffers as much as possible, in order to increase performance.

Row-by-row transformations are known as synchronous. Each input row produces one output row. On the other hand, in partially-blocking and fully-blocking transformations, known as asynchronous, there is no need to have the same number of input rows as output rows (they need no output rows at all).

From execution trees to data paths

An execution tree is defined as a section in a data flow starting from an asynchronous component and ending in destinations or transformations, with no synchronous output.

During a package execution, the SSIS engine divides the dataflow into execution trees, and these define the way buffers and execution threads are created and managed throughout the package. Each of these trees forces the creation of at least one new buffer and could create a new execution thread.

A new buffer requires extra memory to deal with the transformation that it is associated with. It is important to note that the creation of a new execution tree DOES NOT always involve a new execution thread.

Figure 1 - Execution tree schema

Figure 1 – Execution tree schema

In this schema we have:

  • Tree 1: It is created when reading rows from “Customer Source”, and keeps them in memory until they are loaded in “Customer Error Destination”
  • Tree 2: Works in the same way as Tree 1, but refers to “Customer Source 2” and “Customer Error Destination 2”
  • Tree 3: Reads “Customer Source” output buffer rows, copies them into new buffers and makes them available to “Union All” component”
  • Tree 4: Works in the same way as Tree 3, but for “Customer Source 2” and “Union All”
  • Tree 5: Copies rows into its new buffers and executes “Union All”, “Data Conversion”, “Derived Column” and “Aggregate” components. All these operations share buffers that are reused when they are not being used by previous asynchronous operations.
  • Tree 6: Once the “Aggregate” component has finished, its output rows are copied into “Tree 6” new buffers and loaded into “Customers Final Destination”

In SQL Server 2008, execution tree names were changed to paths, and its behavior basically remains untouched. The difference is that now, inside each path, other subpaths can be created, taking advantage of the inherent parallelism between the transformations included. Each of these subpaths can be executed in parallel. This is a big step forward for developers, who are now liberated from concurrent execution design. Nevertheless, even now, it is important to understand how these mechanisms are defined, in order to take advantage of them and avoid certain problems, as we will see during the article.

These subpaths are created by SSIS automatically, after analyzing whether transformations inside a path can be parallelized.

In this example we have a dataflow that would imply two execution trees in SSIS 2005, but from SSIS 2008 onwards it implies 2 paths and 2 subpaths. Paths are equivalent to the execution trees that we already know:

  • Path 0: From “ OLE DB Source” to “Union All”
  • Path 1: From “Union All” to “ OLE DB Destination”

Now, for Path 0 we have 2 subpaths:

  • Subpath 0: From “Multicast” to “Union All”, including “Derived Column”
  • Subpath 1: From “Multicast” to “Union All”, including “Derived Column 1”
Figure 2 - Package including subpaths

Figure 2 – Package including subpaths

By adding PipelineExecutionTrees to the SSIS log event, it is possible to display paths and sub-paths.

Figure 2 - Package including subpaths

Figure 2 – Package including subpaths

The log shows how 2 subpaths are created inside Path 0, taking advantage of the possible parallelism present in that section of our data flow.

Buffer sizes

Buffer size is a property, and it is defined by the SSIS engine, which in all cases will try to increase performance, though this will depend on the values of certain configurable properties and on its internal limits:

  • DefaultMaxBufferRows: This is a dataflow property. It is configurable, although SSIS defines it by default at 10,000. SSIS multiplies Estimated Row Size times DefaultMaxBufferRows, to estimate how big a buffer would be.
  • DefaultMaxBufferSize: Again, a dataflow property. It is fixed to 10MB by default. It is possible to increase it trying to improve performance, but there’s an upper threshold of 100MB, known as MaxBufferSize, which cannot be changed.

In the same way as MaxBufferSize, it is SSIS’s lower threshold for its buffers. It is defined by the virtual memory allocation system of the operative system. It is usually defined to 65,536 bytes, but that may change between machines.

Changing and combining these properties and parameters may improve performance or change packages’ behavior in some way. For instance, reducing the number of rows in each buffer – and therefore its size – could avoid memory pressure issues when dealing with partially-blocking transformations.

By using the BufferSizeTuning event in our event log, we could observe how SSIS adjusts our buffers’ sizes depending on the data sources and the aforementioned parameters and properties. In addition, we could see how Integration Services engine adjusts buffer sizes depending on the run-time conditions.

Figure 4 - Buffer tuning using the event log

Figure 4 – Buffer tuning using the event log

Parallelism in SSIS

We have seen how SSIS deals with buffers and how, through paths, they are related to execution threads. Now we will discover how to generate new threads and to maximize their potential benefits.

Configuring parallelism properties

The first SSIS property which is fundamental to this matter is located at control flow level: MaxConcurrentExecutables. It specifies how many threads can be executed in parallel per package. By default, it is set to -1, which stands for “the number of logical processors of the machine plus 2”. The second one is located at data flow level and is called EngineThreads. It defines how many threads are allowed to be executed inside that particular data flow.

MaxConcurrentExecutables works as upper bound for execution threads. Therefore, the sum of values from the EngineThreads data flows properties is 10, if MaxConcurrentExecutables is set to 6 and there are 6 threads currently running inside the package, the SSIS engine will prevent the creation of new ones, regardless of the distribution of threads across data flows.

By modifying these properties we can increase the concurrency of some data flows which are most susceptible to being parallelized, and constrain others where extra threads would not be useful. Moreover, depending on our execution environment, we can configure the behavior of the engine to adapt it to our resources. We may like to reduce the threads that SSIS packages spawn in a non-dedicated server, or in a single-core machine, for example.

Design patterns

As we previously discussed when addressing the decisions about buffer configurations, we will have to choose where and how we want to favor parallelism. For a single study case, we will analyze the effects that each option would produce.

Option A: Parallelizing destinations

Figure 5 – Parallelizing destinations

In figure 5, both the data source and the aggregate transformation share one single path, and therefore the same buffers and threads. They are not parallel operations. However, the insertion in different data destinations is executed in different subpaths. How beneficial is this will depend on where are we inserting data (partitioned tables, indexes present, disk workload, etc).

Option B: Partially parallelized operations

Figure 6 – Partially parallelized operations

If we follow this pattern, the results we obtain will depend on the SSIS version we are working with.

If we had SSIS 2005, we would have the exact same situation as in Figure 5. None of the aggregate operations is really a parallel operation, because all of them are carried out within the execution tree which is created with the data source. In order to force parallelism, we could introduce a partially blocking operation such as a Union All transformation before the aggregate operations. It would force a new execution tree, with its own buffers and threads. They would deal with the aggregate transformations in parallel, although an associated cost creating new buffers and copying information should be taken into account.

As from SSIS 2008, we have parallel executions without any need to introduce extra elements, as a result of the engine enhancements previously discussed. If, having added the PipelineExecutionTree event, we check out the event log, we would see something like this:

Figure 7 – Native parallelism starting at SSIS 2008

Four subpaths are generated without our intervention, so the aggregate operations can be executed in different threads, each of them reading from Multicast’s output buffers.

Option C: Fully parallelized

Figure 8 – Fully parallelized

If our execution environment has enough resources, this option results in great performance. However, we could be wasting resources due to the parallel read of the exact same data. In other words, we would be repeating work unnecessarily.

Detecting bottlenecks through various tools (events log, performance counters, etc.), and our knowledge about the data that we are working with, will both be extremely helpful when designing our data flows to provide the best performance, thus helping SSIS to use its available resources. In Figure 9, performance was increased because we knew that we could split our data into data ranges.

The cornerstone is not just, through specific actions, to improve performance for a package that already exists, but also to create an optimum design from the very early stages of development – from data sources to destinations, passing through efficient task design, data flow configuration and buffer usage.

Hands on: practical examples

As we have seen before, in SSIS 2005 we can force the generation of a new execution tree by using a Union All transform, even when it is not necessary for the package’s data processing. It may be useful if the processes which are dependent on the Union All component (all of which follow in the data flow) are susceptible to being parallelized. Otherwise, the costs of generating and copying data to new buffers could be higher than the benefits that we could obtain. However, these potential benefits may be a double-edged sword, as we will see in the following examples.

Figure 9 – Example using Union All

This package generates 10 million rows, loads into cache a very small table with the lookup transformation, and applies a Union All component before sending the rows to a junk destination. The Data Conversion component is there just for demonstration purposes, it is empty. Using my local machine, this package took 2 minute and 59 seconds to run.

Figure 10 – Execution time with Union All

When the Union All component is removed, ignoring no match rows on the Lookup component and connecting the Derived Column output to Junk Destination, the package runs in 2 min 30 sec on my laptop:

Figure 11 – Execution time without Union All

 

Our slow example took 20% longer than the fast example, which did not have the Union All component. This penalty is due to the extra CPU pressure generating the new path, buffers and threads and, especially, copying data between the output buffers and the new input. Here we lose one of the greatest benefits of working with SSIS, which is (as discussed before), the data’s fixed position in memory. Due to this, performance decreases notably.

Adding the PipelineExecutionTree event to the events log in SSIS 2008 and later versions, we can see that there are no subpaths hanging from the main path. This means that SSIS has understood the meaning of the operations and has decided that the benefits of parallelism would not have any effect in this case.

Figure 12 – Subpath analysis without Union All

The components which follow after Union All will not take advantage of the effects caused by a new path and, therefore, the associated costs will overcome the benefits that it could add.

Conclusions

As seen above, it is very useful to monitor what is happening in our SSIS packages, since this will help us to fully understand what is being affected by our package design. When working with noisy environments, it could be useful to build a testing structure in which we will be able to monitor the phenomena occurring in the SSIS’ engine core more clearly. Having only our own processes running in the environment will make the analysis easier, since it will exclude external issues that could distort our conclusions (e.g. other processes blocking tables into which we are writing data).

Besides the aforementioned properties and processes, there are many elements which could influence Integration Services performance. Network traffic, physical disk status, etc, are also factors to keep in mind. A complete performance analysis and re-design should include overall system vision and planning.
As in most performance-related discussions, how much we test in specific scenarios is an essential factor.

References

The Integration Services Performance Tuning Techniques Guide

Pau Sempere
Latest posts by Pau Sempere (see all)