Lately i was working on project where the current ETL process was approaching 8 hours run time. There were a few critical tables to had over 100,000,000 rows to load that were bogging down the system. The entire process was done by a ton of SQL Stored Procedures that were simply executed by an SSIS package.

One of the tricks i used was to figure out how to only apply changes and not do a full load each night. The source data was not setup for auto change detection and did not have a modified date i could use. This was a head turner.

The solution was to move lots of the code from SQL into SSIS. The following discussion breaks down various features that were used.

The first thing i tackled was a massive SQL statement that had a ton of joins (lets say close to 20). I broke this up by taking almost all of the joins out and moving those to SSIS lookups. This allowed the original retrieval of data to become quite fast.

There were a couple of areas where i could operate on the smaller data and make decisions at that point. This would limit the amount of rows from further down the pipe.

Next, i went through the current SQL code that worked on the tables i was working on. I found a ton of Update statements that would process the entire table and some Updates that would use a Where clause. I moved all of this code in either Script code or Derived Column. Since the data was in memory at this point, the new code works very fast and eliminated many expensive SQL operations.

Now, how could we detect the changes to rows? The solution was to calculate an SHA hash value for each row and store it in a new table.  The field name is ChangeHashSHA and the row is keyed with an ID to the source table row. The length of the ChangeHashSHA field is 88 and its a string. This field was added to the incoming data using a Derived Column Transformation.

First i created a script method to calculate the hash value for a field:

Public Function CreateSHAHash(ByVal Input As String) As String
 Dim HashTool As New System.Security.Cryptography.SHA512Managed()
 Dim inputBytes As Byte() = System.Text.Encoding.ASCII.GetBytes(Input)
 Dim hashBytes As Byte() = HashTool.ComputeHash(inputBytes) ' Convert the byte array to hexadecimal string
HashTool.Clear()
Dim myMetadata As IDTSComponentMetaData100
 myMetadata = Me.ComponentMetaData
 Return Convert.ToBase64String(hashBytes)
 End Function

Next i created a method to add the value to the main value:

Private Sub AddItemToUpdate(ByVal ColumnValue As String)
 UpdateTestValue &= ColumnValue.Trim().Replace(" ", "").Replace(":", "").Replace("-", "")
 End Sub

The next code comes from the main script method (Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)). This

If Row.ClaimAcceptedDateConverted_IsNull = False Then
 AddItemToUpdate(Row.ClaimAcceptedDateConverted.ToString())
 End If
'Now add the new hash to ChangeHash
 Row.ChangeHashSHA = CreateSHAHash(UpdateTestValue)
 Row.ChangeDate = Me.Variables.BuildDateTime

There is an If block for each field in the source table. You can see how the call to CreateSHAHash will pass back the new SHA has for the row.

Now that we have a hash for the row, we can lookup the stored hash (if there is one) for that ID and test them. If they match, we drop this row as it has not changed. If they don’t match we do an update of the source and the new hash value and if the lookup fails, we do an insert and also insert the new hash.

One other feature we used here was the MultiCast transform. The original load would do an insert into the History table and then after all the processing was finished a Select Into another table. Instead of doing this, i used the MultiCast to split the incoming stream and make sure each record was going to the 2nd table and then do an insert. This totally eliminated an expensive SQL process (Select Into).

The next post will dig into another are of this application.

 

Latest posts by Ken Spencer (see all)