Microsoft Excel is a very popular tool among users, so it’s common to find Excel files when implementing extraction, transformation, and loading (ETL) processes. After I explain the reasons why users create Excel files and where you might find them, I’ll talk about some common problems that you might encounter when dealing with them in ETL processes, such as problems with drivers, hidden data, data types, certain transformations, separators, and worksheet variants. I’ll also present the recommended solutions for these problems.
Why Excel Files Are Created and Where to Find Them
Excel is a commonly used tool in companies for several reasons:
Users feel very comfortable using Excel to manage and visualize data.
Users often export data (e.g., query results, reports) from other applications to Excel workbooks because many applications provide that option. So, Excel workbooks are used for information storage, as Figure 1 shows.
When other applications don’t offer users a way to manage data, users turn to Excel.
For example, suppose an HR manager wants to classify store employees by the store in which they work, but the application used by the HR manager doesn’t have a column to note the store classification. The HR manager might create an Excel workbook, list all the employees, then specify the store in which each employee works. Instead of asking the IT department to create an application to manage some data, users might create Excel workbooks. For example, in order to track and compare the rates of suppliers, a purchasing clerk might create a new Excel workbook that lists the suppliers, the products they offer, and the prices of those products.
Users might use Excel to connect to several relational and multidimensional data sources in order to query them, then store the query results in Excel workbooks.
Figure 1: Excel as information storage
As you can see, Excel complements the existing applications in companies in many ways. Therefore, Excel becomes both a source of data and a destination for data, as Figure 2 shows.
Figure 2: Excel as a source and a destination
When dealing with the execution of ETL tasks, you usually find Excel files between sources and destinations. So, I recommend that whenever you deal with a business intelligence (BI) project, you review in detail all the Excel files in the company. It’s common to find Excel files that contain important business information that isn’t stored in any other application. If you find some, you must deal with them in your ETL processes.
Sometimes we find Excel files just because the user feels comfortable with this tool, and it doesn’t really matter the format in which they’re stored because of their simplicity. For example, in a case like the previous one, in which the user creates in Excel a table with codes and descriptions of provinces. In such cases, and given that Excel works without any problem and in a clear way for the user with the format .CSV (Comma Separated Values), we must consider if it’s worth avoiding problems and restrictions of Excel and just storing the information in .CSV format. We just have to talk to the user in order to establish the storage format. Let’s weigh up the pros and cons of the Excel format in comparison to the .CSV format.
Given the frequency in which important business information is found in Excel files, let’s look at the typical problems that you’ll encounter when you use Excel files as a source when working with SQL Server Integration Services (SSIS) packages in ETL processes.
Problem: 32-Bit Drivers in a 64-Bit Environment
Nowadays, most processors are 64 bits, but there still is a lack of 64-bit drivers (i.e., data providers) for some programs, including Excel. The Microsoft OLE DB Provider for Jet, which is used to connect to Excel spreadsheets, doesn’t have a 64-bit version. Recommended Solution for Dealing with 32-Bit DriversIf you have a 64-bit server and you need to execute an SSIS package that connects to an Excel data source, you can run that package in 32-bit mode. To do so, follow these steps:
When designing the package, you need to set the Run64BitRuntime property to False in the project’s Property Pages in Business Intelligence Development Studio (BIDS), as shown in Figure 3. Note that this setting will affect the execution of any other packages involved in that project.
Figure 3: Configuration of 32-bit execution in BIDS
When executing the SSIS package in SQL Server Agent, you need to create a job that executes the SSIS package in 32-bit mode. In the step that executes the package, go to the Execution Options tab and select the Use 32 bit runtime check box, as Figure 4 shows.
Figure 4: Configuration of 32-bit execution in SQL Server Agent
Problem: Hidden Data
In Excel, users have the ability to hide worksheets in a workbook. When you use an Excel workbook as a data source in an SSIS package, SSIS won’t have access to any worksheets marked as hidden. As the opposite, hidden columns will be available to our SSIS packages, so no action is needed in that situation.
Recommended Solution for Dealing with Hidden Data
Before you run the SSIS package, verify that none of the worksheets being accessed are hidden. Take into account that the hidden columns will be accessible, and it won’t be necessary to make them visible to use them in our processes.
Problem: Incorrect Data Types
Excel is a tool directed at calculation. Its management of data types is very unsophisticated compared with the data type management in relational database managers, ETL tools, and SSIS tools. As a result, when you use an Excel file as data source in SSIS, you might encounter incorrect data types. To understand this problem, it’s helpful to know how data types are assigned when you use an Excel file as a data source in SSIS.
However, since in Excel we don’t have the possibility of indicating the data types, we just write in cells and we can’t define them in any place. So, how are the data types assigned when SSIS reads from an Excel source? There is an automatic assignment process. The first eight rows in a column are read, and depending on the values in those rows, one of four data types is assigned to that column:
DT_R8—numerical data without further distinctions regarding type (e.g., integer, currency value) or size (e.g., maximum or minimum size in bytes)
DT_DATE—dates and times
DT_WSTR—strings up to 4,000 characters
DT_NTEXT—strings longer than 4,000 characters
I haven’t found any official documentation specifying the assignment criteria, but based on my experience, here is what I believe happens:
If all the values are numbers, DT_R8 is assigned.
If all the values are dates and/or times, DT_DATE is assigned.
If there’s a string with up to 255 characters, DT_WSTR is assigned.
If there’s a string with more than 255 characters, DT_NTEXT is assigned.
The main problem with the automatic assignment process is the treatment of strings. If you have, for example, a 600-character value in a scanned row, the data type of DT_NTEXT is assigned even though DT_WSTR could have been assigned because it supports up to 4,000 characters.
Conversely, DT_WSTR might be assigned to a column that needs DT_NEXT. For example, DT_WSTR might be assigned to a column that includes 300-character strings because the first eight rows had only 100-character strings. In this case, truncations will occur, resulting in a loss of information and possible errors in ETL processes.
Although you can’t change the fact that the columns’ data types are automatically assigned, you can modify how many rows will be scanned when assigning them. That is the purpose of the TypeGuessRows entry in the HKEY_LOCAL_MACHINESOFTWAREMicrosoftJet4.0EnginesExcel registry key. By default, its value is 8, as Figure 5 shows. That’s why the first eight rows are read. You can change this value to the number of rows that you want scanned. If you specify the value of 0, all the rows in the file will be read. By scanning more rows, you increase the chance that the correct data type will be assigned. However, the higher the value (or when it’s 0), the higher the impact on performance.
Figure 5: Registry entry that specifies the number of rows to be read
Recommended Solution for Preventing Incorrect Data Types
To prevent incorrect data types, I recommend implementing the following methods in the order shown:
Create a sample Excel file to help the automatic assignment process assign the correct data types.
Make any necessary changes in the data type definitions by adding OLE DB Source and Excel Source components to the Data Flow task.
Use Data Conversion and Derived Column transformations to make any other changes needed to obtain the correct data types.
Let’s take a closer look at these three methods.
Creating a sample Excel file. If the rows scanned by the automatic assignment process don’t cover all the typical cases, the correct data type might not be assigned. Although you can change the TypeGuessRows entry so that most or all of the rows are scanned, the package’s performance will suffer. Thus, I recommend that you create a new Excel file and paste the representative values in the first row, so that the correct data type is assigned. If you want to include headings in the Excel file, you can indicate whether the first row contains data or headings in SSIS. By marking the first row as headings, the values will be assigned as columns names and they won’t be considered as a row in the automatic assignment process.
For example, suppose that you periodically receive an Excel file with information about new customers. Two of its columns (“ObservacionesMedicas” and “ObservacionesPsiquicas”) must have the data type of DT_NTEXT. By creating a sample file like that in Figure 6, you can increase the likelihood that those two columns as well as the other columns have the correct data type.
Figure 6: Sample file covering all the typical cases
Making any necessary changes in the data type definitions. If you create the sample Excel file and you are still getting incorrect data types, you can add the OLE DB Source component and Excel Source component to your Data Flow task. These components let you modify the input and output properties, including the data type properties. After adding each component, you need to go to the Input and Output Properties tab in the Advanced Editor. In the Data Type Properties section, you can change the data type. For example, you can change a DT_WSTR data type to DT_NTEXT or vice versa. However, there are some changes you can’t make, such as changing a Unicode string to a non-Unicode string. In addition, although you can change between the DT_WSTR and DT_NTEXT data types in Source Output, you’ll receive an error message when you try to change that column’s data type in Source Error Output, as Figure 7 shows.
Figure 7: Error after changing the DataType property in Source Error Output
Using Data Conversion and Derived Column transformations. If you still have some incorrect data types, you can use Data Conversion and Derived Column transformations to obtain the data types needed. For example, you could use the Data Conversion transformation to change DT_R8 to integer data types (e.g., DT_I1, DT_I2, DT_I4).
Problem: Transformation Restriction
One of the restrictions in Lookup and Fuzzy Lookup transformations is that you can’t join on a column that has the data type of DT_R8. This restriction can cause problems when some of the data used in a Lookup or Fuzzy Lookup transformation originates from an Excel file. For example, suppose that you have a customer table that has a column containing numerical province codes. When the name of the province needs to be used, you want this SQL Server table to read an Excel file that lists the province codes and the corresponding province names. Typically, you could use a Lookup transformation in which the province code column is the lookup column. However, because the automatic assignment process assigns the data type of DT_R8 to numerical columns in Excel worksheets, the province code column can’t be used as the lookup column in the Lookup transformation.
Recommended Solution for Dealing with the Transformation Restriction
To work around this transformation restriction, I recommend importing the Excel data into a SQL Server table and using that table in the Lookup or Fuzzy Lookup transformation. Lookup tables usually have only a few rows, so they don’t take long to import or take up too much space in the database.
Problem: Different Separators
Another problem inherent to using Excel files as a data source concerns the assignment of the decimal separator and the thousands separator in numbers. By default, Excel uses the system-defined decimal and thousands separators, which can vary depending on what region you’re in. For example, the period (.) is used as the thousands separator in Spain, whereas the comma (,) is used as the decimal separator in the United States. This might introduce problems if an SSIS package is joining data from Excel files originating from different parts of the world.
Recommended Solution for Dealing with Different Separators
To avoid problems associated with having Excel files that use different decimal and thousands separators, you can override the system-defined separators by setting custom separators in Excel. As Figure 8 shows, you clear the Use system separators check box, then enter the separators you want to use.
Figure 8: Configuration of custom separators
Problem: Worksheet Variants
When an SSIS package is created to periodically read the same Excel files, it is possible to have workbooks in which the number of worksheets changes. For example, an SSIS package might regularly read a workbook in which users add a worksheet each month showing the sales data for that month. Depending on which month it is, the workbook will have a different number of worksheets. It is also possible to have workbooks in which the names of the worksheets change. For example, a workbook containing product information might have one worksheet for each product and use the product’s name as the worksheet’s name. If the product’s name changes, the worksheet’s name will also change. In addition, a worksheet might be added if a new product is introduced or a worksheet might be deleted if a product was discontinued.
Recommended Solution for Dealing with Worksheet Variants
To handle changes in worksheet numbers and names without having to modify the SSIS package, you can use a Script task that contains Visual Basic .NET or C# code.
Broadly speaking, the process consists of two nested loops: one loop reads every Excel workbook in a specified folder and another loop inside the first loop reads every worksheet of each workbook, as Figure 9 shows. The last step is the importation of data from each worksheet to the destination.
Figure 9: Process to handle changes in worksheet names and numbers
What Users Would Like vs. What Is Best for ETL Processses
Now let’s discuss how to deal with a complexity that arises not from a technical issue but rather from different viewpoints between users and the IT team. In some situations, the users and IT team must use same Excel file. This is a problem because the optimal format for ETL processes differs from the optimal format for visualizing and updating data.
For example, suppose that a daily sales report is not only viewed by several users but also imported to the company’s data mart. How would users like to see the report formatted? As Figure 10 shows, they’d like it formatted in a way that makes the data easy to read and analyze—that is, without redundant data, with totals, with descriptive names (and not only codes), and with lines.
Figure 10: Ideal sales report form the user’s perspective
What is the optimal format for ETL processes? In this case, you don’t need to include the payment method (D 13 cell), the customers’ names (C column), or the items they bought (E column) because that information is in another table. You don’t need the totals because you’re going to calculate them elsewhere, but you do need a date in each row rather than having a date presented once at the top. Figure 11 shows what this report would look like when formatted for ETL processes.
Figure 11: Ideal sales report from the ETL process perspective
As Figure 12 shows, both the users and the IT department must come to an agreement on a format that is suitable for both parties and that doesn’t complicate the other party’s work. One way to do this is to present users with the format needed in the destination in SSIS. If there are some design elements that the users want that are reasonable, you can incorporate them into the design. For example, in this scenario, it’s reasonable to include the totals. After reaching an agreement, have the users make the necessary changes to the Excel file’s design.
Figure 12: Users and the IT team must reach an agreement on the format of the Excel file
Prevention of Design Changes
Excel files can be created and designed by a user, by an application with no user involvement, or by a combination thereof (e.g., created and designed by an application, then redesigned by a user). When users are able to change the design of a Excel file, you must try to prevent them from making changes that will result in files of the same type having different formats. For example, the customer code can’t be in column B in one file and in column D in another file. Although this design difference probably won’t matter for the user, it’s a very serious problem for the ETL process because it will cause an error during execution.
What solution can you use to prevent users from making a design change that will introduce an error? From a technical point of view, none. So, you must talk with the users and make them understand that the format must not be changed and that they will be held responsible if an error occurs because they made a change. This seems easy, but in reality, most users don’t accept the responsibility for these errors and try to blame the IT department instead. Therefore, you have to do your best to make it clear that the design must not be changed. You might even want to put this in writing.
As you’ve seen, there are many issues to keep in mind when you use Excel files in SSIS packages. So, here is a list of best practices that you can review every time you work with Excel files in an ETL process: Analyze the Excel data in detail to identify problems with data types. To solve those problems, implement the following methods in the order shown:
Create a sample Excel file that contains all the typical cases to help the automatic assignment process assign the correct data types.
Make any necessary changes in the data type definitions by adding OLE DB Source and Excel Source components to the Data Flow task, then using the Advanced Editor to make changes.
- Use Data Conversion and Derived Column transformations to make any other changes needed to obtain the correct data types.
Check for other types of problems:
If the Microsoft OLE DB Provider for Jet is installed on a 64-bit server, make sure that the Run64BitRuntime property in BIDS set to False when you design the SSIS package and the Use 32 bit runtime check box is selected when you execute it in SQL Server Agent.
Make sure that none of the worksheets being accessed are hidden.
If your SSIS package includes a Lookup or Fuzzy Lookup transformation that uses Excel data, make sure that the lookup column isn’t numerical. If it is, import the Excel data into a SQL Server table.
If your SSIS package includes Excel files created in different parts of the world, make sure that they all use the same decimal and thousands separators.
If your SSIS package regularly reads the same Excel files, use a Script task to handle changes to worksheet numbers and names.
Talk with users:
Both the users and the IT department must come to an agreement on a worksheet format that is suitable for both parties and that doesn’t complicate the other party’s work.
Users need to know that the worksheet format must not be changed and that they will be held responsible if an error occurs because they made a change.
- Integration Services Key Feature in BI Projects (Part 1 of 2) - October 10, 2012
- Using Excel in ETL Processes - September 11, 2011