Data Warehouse is a key feature in a BI project, which requires the existence of an ETL process to feed it from different sources. SQL Server contains a tool called Integration Services (SSIS) that greatly helps to carry out the ETL processes. In this article, we will tour an SSIS project, talking about the key features, reviewing the design and output, and showing the most used components.
Firstly, I want to clarify that this article is based on the premise that our BI project is going to be carried out specifically with the tools that SQL Server provides. It is the aim of this article to be focused on SQL Server, its tools, and my experiences with them. If a reader needs more conceptual details, there is a lot of information available about design, building and load of a Data Warehouse. I recommend reading the publications of two gurus in this matter: Ralph Kimball and Bill Inmon.
In most parts of BI projects, we have a key component which is our Data Warehouse or Data Mart. It will store the data on which the users’ queries are centered. It will also be useful as data source for the load of Analysis Services Cubes and for running the information from reporting and analysis tools.
We are now going to deal with the implementation of ETL processes using Integration Services, starting from the premise that our data sources will be relational databases (SQL Server, Oracle, MySQL or other database), and that we will have a Staging Area and a Data Mart, both being databases that are stored in SQL Server. We will follow some guidelines that we have been compiling based on our experience – Figure 1.
What is Integration Services?
Integration Services is a tool that was released with SQL Server 2005, taking a huge step forward com-pared to the tools provided in previous versions. This is the definition given by Microsoft for this tool:
“Microsoft Integration Services is a platform for building enterprise-level data integration and data transformations solutions. You use Integration Services to solve complex business problems by copying or downloading files, sending e-mail messages in response to events, updating data warehouses, cleaning and mining data, and managing SQL Server objects and data. The packages can work alone or in concert with other packages to address complex business needs. Integration Services can extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations.
Integration Services includes a rich set of built-in tasks and transformations; tools for constructing packages; and the Integration Services service for running and managing packages. You can use the graphical Integration Services tools to create solutions without writing a single line of code; or you can program the extensive Integration Services object model to create packages programmatically and code custom tasks and other package objects.”
As mentioned in this definition, SSIS is not only a tool for ETL, but it also has a set of tasks for administration that will be used by DBAs.
I would like to stress that these kinds of tools are very powerful. Development with them is very fast, and they allow us to create several development and implementation processes within short time periods. It is very simple to create a new package that migrates and transforms data to a new destination, and creates dependencies, even without having a good plan that covers the global aim. For that reason, I consider that these tools can be double-edged swords: if we use them properly, they can help us a lot to generate processes, coordinate their execution, and manage the dependence chains among them. But using them without a previous design and an analysis can create big problems; even touching the simplest process could be very difficult, because of the dependencies and implications that it has with other processes and that will stop their normal running The expression “a picture is worth a thousand words” applies well here – Figure 2.
I can attest that Integration Services is a very complete tool, with lots of features, but with some risks if we do not use it properly.
Integration Services as ETL Tool
Let us focus on SSIS features as an ETL tool and especially on the use of its data flows, studying the particular case in which the sources as well as the destination are relational databases. We are going to use Integration Services to implement a set of processes of Extraction, Transformation, Cleaning and Loading of data in our Data Mart, going through a Staging Area.
Load of the Staging Area
Let’s start by giving a set of recommendations and some guidelines that we should bear in mind every time we deal with a data load, and for which it will be very helpful to have a staging area:
Minimum impact on the source: to achieve maximum scalability and in order to affect the output of the transactional servers as least as possible, we should consume the least resources possible from them, and focus the resources consumption in the server that carries out the ETL processes.
Data traceability: it is very important to know where, when and who has done every change. For this purpose, we must store every change that we find as well as carry out the processes of extraction and obtaining of differences between the source and the destination, in order to track any data and its changes.
Creation of Delta tables: It is always the best alternative to have a delta table that registers the operation insertions, updates and deletions that have been done in the source, and applies those changes in the destination. This process is more efficient and, of course, consumes fewer resources in source and destination than doing a reading of that source comparing it to the destination, and obtaining from it the differences to apply to the destination. Some systems can create this kind of tables, but if it is not our case, we can always generate them in our ETL processes.
Data cleansing: another point we should bear in mind is that, in this kind of processes, the aim is not only to bring the data to the destination and do some transformations. It is very important to carry out a set of data cleansing tasks and detection of anomalies if, for example, we find an item code from which we receive a sale, but it is not already in our system. We can correct the data that we know is wrong, or we can simply not introduce it to the destination and leave it somewhere for someone to review.
We are going to see briefly, through a series of images, an example of load of a table in the staging area – See Figures 3, 4 & 5 below:
In this case, we have a column in the data source that indicates the date of the last modification carried out in each row, which we should use in the design of our process. This implies that we must divide our process in two blocks; one for carrying out the insertions and modifications; and another for deleting, in order to prevent it from considering deleted rows in our system, those that have not been read from the source, because they have not been modified from our last load process.
Data Mart Load
Before going into these processes, let’s review briefly a series of key concepts that we should take into account in the design of our Data Marts or Data Warehouses, as well as in the implementation of the ETL processes:
SCD (Slowly Changing Dimensions): the slowly changing dimensions make it possible to have different ways of registering the data modifications. When changes take place, perhaps we would like to keep a record of the modifications or simply overwrite the previous values. There are several different strategies to address this topic, but we are going to talk about the two most commonly used:
SCD type 1, overwrite: this type is the most basic type, and it consists simply of overwriting in the dimension, without generating a record of the modifications in the source. We only keep the last version.
SCD type 2, add rows: this type consists of registering a record of the changes, by generating a new row for each version of the modifications. For that pur-pose, some additional rows must be added that make it possible to register the period of validation of each version as well as a key to identify them (Surrogate Key), given that the source key (Business Key) is no longer unique because it recurs in each one of the stored versions.
Lookups: a lookup process consists of accessing a table through a key that gives us other columns from that table for the queried key. In the case of loads of the tables, given that what we obtain from the source are the Business Keys (for example, the code of a sales item), we will have to lookup for each Business Key, its corresponding Surrogate Key, using the dimension item (based on when the fact occurred).
Inferred Members: sometimes we receive keys that refer to other tables, whose key is not in the table. Because of integrity problems, or simply because of the timing that the data arrives from both tables. For example, maybe the sale of an item that is not in the table arrives, either because there is no declarative referential integrity in the database, or because we have a distributive system and, although we have received the sale, we have the arrival of the item pending. In such case, and in order to have all the sales rows, we should create a row in the items table only with the data that we have available and leave it marked as unfinished, in order to complete the columns that we have left as pending when we receive the item.
We are going to see briefly, through a series of images, an example of load of a table of dimensions in the Data Mart, from a Delta table – See Figures 6, 7 & 8 below:
In this case we are reading the data from a staging area, in which we have included information about every execution of the ETL process, so that if we fail to load the Data Mart, we can do loads in the same order than they took place in the staging area. We have also divided the process in two parts; one for the data updates based on joins, and another one for the insertions of new rows
In this month’s column we have seen how to load data from a data mart into a staging area using com-monly used features in SSIS. Best practices for loading data should be considered during the planning stage of designing the ETL process. Watch out for my second article next month as part of this two part series.
- Integration Services Key Feature in BI Projects (Part 1 of 2) - October 10, 2012
- Using Excel in ETL Processes - September 11, 2011