Tuesday, July 29, 2014

SSIS Tips: Extract and Populate Multiple Table Relational Data from Multipart File

I'm back with another scenario regarding text file data transfer.
The impetus to writing this blog was recent mail I got from one of my blog followers. He had a scenario which he wanted me to provide him with a solution with.
The scenario here was a multipart file containing the details of orders. The file has three different types of row indicating order header information, the details and also the summary information. The metadata of the three types of rows are different so it was not possible to use a standard data flow task with flat file source to process and get the data from the file. Also the file data has to go into three separate tables set up in the database which also have referential integrity set up through foreign key constraints. As always with requirements of this type, they had little control over how the file was getting generated from the source.
I suggested a work around for handling the scenario which I'm sharing here
The file looked like this
As you see from the above the file consists of three types of rows. The rows starting with H designates the Order Header information and consists of column values for OrderNo,OrderDate and CustomerName columns.The rows starting with D indicates the details of the order which includes ItemName,PartNo for the item,Quantity ordered,UnitPrice of the item and also the Shipset number (Shipset represents a unit of deivery for the Order). Finally the rows with S represents Order Summary information with latest Delivery Date (of the last shipset) and Total Order Value. So group of rows from H till the following S represents the details of a single order and have to be stored against same reference ID value (OrderHeaderID).
The data from the above file has to go into the below tables in database
OrderHeader (OrderHeaderID (PK IDENTITY),OrderNo,OrderDate,CustomerName)
OrderDetails (OrderHeaderID(FK to OrderHeader),ItemName,PartNo,Qty,UnitPrice,ShipsetNo)
OrderSummary(OrderHeaderID(FK to OrderHeader),LatestDeliveryDate,OrderTotal)

The package for this looks like below

The steps involved are
1. Execute SQL task for truncating and preparing StagingTable for data load. The Staging table will have a RowID column with IDENTITY property 
2. Data Flow Task to load the staging table with file data into a single data column in comma separated format
3. Execute SQL tasks to generate the group ids for identifying the header,details and summary rows belong to each group. This is done in two steps. First header rows are assigned a sequential value and then use to apply to following detail and summary rows
The logic looks like below
UPDATE   t
SET Grp =Seq
FROM
(
SELECT *,ROW_NUMBER() OVER (ORDER BY RowID) AS Seq
FROM StagingTable
WHERE LEFT(DataColumn,CHARINDEX(',',DataColumn)-1) = 'H'
)t

UPDATE t
SET Grp= t1.Grp
FROM StagingTable t
CROSS APPLY (SELECT TOP 1 Grp
FROM StagingTable
WHERE Grp IS NOT NULL
AND RowID < t.RowID
ORDER BY RowID DESC) t1
WHERE t.Grp IS NULL

The logic orders the records on the basis of RowID value, generates a sequence ID value using ROW_NUMBER function and assigns it to the header (H) rows. This is followed by a further update statement which assigns the GroupID to the corresponding details and summary rows which comes after the header row. So at the end of this step all rows belonging to a single order will get the same group id value. This is required for applying the newly generated orderheader id (IDENTITY value) from OrderHeader table  to each of details and summary rows for the order while doing final data population
 4. Execute SQL task to execute procedure to do the final data load with logic as below

CREATE PROC PopulateOrderTableData
AS

DECLARE @INSERTED_ORDERS table
(
OrderHeaderID int,
GrpID int
)

IF OBJECT_ID('tempdb..#Headers') IS NOT NULL
DROP TABLE #Headers
IF OBJECT_ID('tempdb..#Details') IS NOT NULL
DROP TABLE #Details
IF OBJECT_ID('tempdb..#Summary') IS NOT NULL
DROP TABLE #Summary

SELECT RowID,Grp,[1],[2],[3],[4]
INTO #Headers
FROM StagingTable s
CROSS APPLY dbo.ParseValues(s.DataColumn,',')f
PIVOT (MAX(Val) FOR ID IN ([1],[2],[3],[4]))p
WHERE [1] = 'H'

SELECT RowID,Grp,[1],[2],[3],[4],[5],[6]
INTO #Details
FROM StagingTable s
CROSS APPLY dbo.ParseValues(s.DataColumn,',')f
PIVOT (MAX(Val) FOR ID IN ([1],[2],[3],[4],[5],[6]))p
WHERE [1] = 'D'

SELECT RowID,Grp,[1],[2],[3]
INTO #Summary
FROM StagingTable s
CROSS APPLY dbo.ParseValues(s.DataColumn,',')f
PIVOT (MAX(Val) FOR ID IN ([1],[2],[3]))p
WHERE [1] = 'S'

MERGE INTO OrderHeader t
USING #Headers s
ON 1=2 
WHEN NOT MATCHED THEN
INSERT (OrderNo ,OrderDate ,CustomerName)
VALUES (s.[2],s.[3],s.[4])
OUTPUT INSERTED.OrderHeaderID,s.Grp INTO @INSERTED_ORDERS;

INSERT OrderDetails
SELECT i.OrderHeaderID,[2],[3],[4],[5],[6] 
FROM #Details d
INNER JOIN @INSERTED_ORDERS i
ON i.GrpID = d.Grp

INSERT OrderSummary
SELECT i.OrderHeaderID,[2],[3] 
FROM #Summary d
INNER JOIN @INSERTED_ORDERS i
ON i.GrpID = d.Grp

The procedure created three temporary tables #Header,#Details and #Summary and populates them with header,details and summary rows of file data. The logic makes use of UDF ParseValues explained here http://visakhm.blogspot.in/2010/02/parsing-delimited-string.html
to parse the column separated data from the table field into separate values and pivoted into multiple columns. 
Then it makes use of a dummy MERGE statement to do insertion to OrderHeader table and captures the generated OrderHeaderID values along with corresponding Grp values to a table variable. The MERGE statement is necessary because we need to capture Grp value from source table along with IDENTITY field value using OUTPUT clause and this is possible only through the MERGE statement. Once this is done then we populate rest of the tables using corresponding temporary tables and uses a join to link to table variable on Grp value to get the generated identity value inserted to the tables. 
At the end you can run a set of select statements in your table and see the populated data as below

As seen from the above data got populated to our tables as expected
This is a good approach you can use to extract data out of a multipart file and use it to populate a set of related tables in the database maintaining the referential integrity. 
The main point to remember here is the generation of a GroupID value to group the similar Order rows together to make sure generated OrderHeaderID gets assigned correctly to each of them. 

The package and sample file used in the illustration can be downloaded from the below link

Package :

Sample file:

Feel free to get back for any comments/clarification you may have on the above

Wednesday, July 2, 2014

SSIS Tips: Handling Inconsistent Text Qualifiers in Flat File Source

Last week I had blogged about how to handle embedded text qualifiers coming in flat file source data in the below blog.
http://visakhm.blogspot.in/2014/06/ssis-tips-handling-embedded-text.html
One of the feedback I got was from a ETL developer who asked whether its possible to handle files with inconsistent text qualifiers using flat file source in SSIS. He had also sent a sample file which I've adapted here for illustrating the solution I had provided for him.
The file looks like below
As you see the file has inconsistent text qualifier ie for 3rd column it has combination "| as text qualifier and for other columns its | alone. It also has " characters coming inside data as well (see row 3).
In order handle this the basic approach would be same as what we discussed previously. We will add a separate data flow task to handle the inconsistent text qualifiers and make them consistent throughout.
The first data flow looks like below
The flat file source here reads the entire row contents of the file as a single column. This is achieved by specifying the row delimiter as {CR}{LF} and no column delimiter specified. The derived column task will  make text qualifier consistent by doing pattern replace using REPLACE function. The expression used would be this

REPLACE(REPLACE(DataColumn,",\"|",",|"),"|\",","|,")

This is then saved into a new flat file temporarily which would have structure as below
You can see from the above that text qualifier is now consistent (|). 
Now you can use a new data flow task with above file as the source, choose delimiter and text qualifier as , and | and it will work fine populating the data to your destination table. 
This is an approach you can apply to extract data out of flat file with inconsistent text qualifiers and get it loaded into your database table using SSIS.
The sample file and package for this can be found in the below link. 
Replace the connectionstrings with your actual values and you will be able to execute and test the package for yourself.
Feel free to revert for any further clarification. Happy coding!

Tuesday, July 1, 2014

T-SQL Tips: Retrieve Connectionstring Details from SSIS Package

I've published a new article on MSDN Wiki which explains how connectionstring details can be extracted from an existing SSIS package.
You can find the article in the below link

http://social.technet.microsoft.com/wiki/contents/articles/25150.t-sql-retrieve-connectionstring-details-from-a-ssis-package.aspx

This approach can be used when you want to audit the existing SSIS packages and understand the details like connectionstrings its using, which SQLServer instances its trying to connect to etc and comes handy in maintenance projects where you inherit an existing system with no or little documentation.

Feel free to revert for any clarification

Monday, June 23, 2014

SSIS Tips: Handling Embedded Text Qualifiers in SSIS Flat File Source

Recently in one of my projects there was a case where I had to transfer the data from flat file to SQLServer table. The file was generated by an automated process at the client side and was FTPed to our landing location. I created a simple package to do the data transfer with a single data flow task which looks like below.
On executing the package it broke with the below error
I checked the source file and it resembled the below structure (actual data replaced by sample for illustration)
On careful analysis I understood the issue. The reason was due to the presence of embedded text qualifier characters within the data ie Description value of "Descr3 "Descr4" desc5". Now question was how to handle this to do the data transfer without any issues. In my case it was not possible to fix this at source as the file generation step was automated. This post deals with a workaround which can be applied to get around the above issue.
I added a new data flow task before the current data flow. The purpose of this data flow task was to change the text qualifier character (in this case ") to some other character (here | character) so as not to get confused with the embedded data characters. The data flow will have a structure as below


What this does is to read the entire row contents of the file as a single column (just don't select anything for the column delimiter) and the replace the text qualifier character from " to | and save it as a new file.
The connection manager would be configured as below

Notice the row and column delimiter settings used
Now for the derived column task use an expression as below

SUBSTRING(REPLACE(REPLACE("," + DataColumn + ",","\",","|,"),",\"",",|"),2,LEN("," + DataColumn + ",") - 2)

What this does is to replace the occurrences of ," or ", pattern with ,| or |, which will replace text qualifier character to | from ". A comma (,) character is added to the beginning and end of string for making the pattern consistent and after the replacement  we ignore the first and last characters we added using the SUBSTRING function
Once this is saved to new file we will get the modified file structure as this
.
Now in our main data flow point to the above saved file and configure the settings as below


Now link this to an OLEDB destination to populate your table and on executing you'll get the below data in the table

which clearly shows that the file data was properly transferred to the table.
This is a good workaround you can apply to deal with embedded qualifier characters coming in your data and get your data successfully transferred.
The package and sample file can be accessed from the below link
Package
https://drive.google.com/file/d/0B4ZDNhljf8tQQVRmTkJ4U01BLTA/edit?usp=sharing
Sample file
https://docs.google.com/spreadsheets/d/1HPIaz1VSR9BXfZo27KWxvq2qv12pzLNDRQHRTuCwx0U/edit?usp=sharing
Please revert if you want more clarification on the above and I will be happy to help. Also let me know any other comments you may have. Happy coding!

Monday, June 9, 2014

SSIS Tips: Implementing SCD Functionality Using Hashing Method

There are numerous ways in which you can implement slowly changing dimension processing in SSIS. There is also a standard task available in SSIS called SCD Task for this. But the task will cause a performance  bottleneck for large tables. So quite often people make use of alternatives like using a combination of lookup task with conditional split to do the SCD processing. But problem with this approach is that it would require comparison on each of the non-key columns to check if the value has changed and would be cumbersome for large tables with large number of attribute columns. So a better alternative here would be to apply hashing function over the columns and use the generated value for comparison. This blog discusses two methods by which you can utilize the hashing algorithms to implement SCD processing.

 Method 1 : Applying hashing in SSIS
--------------------------------------------
There is a third party task called Multiple hash available in codeplex (http://ssismhash.codeplex.com/) that can be utilized for this.
The task has a set of algorithms available like SHA1,MD5 etc which can used for generating the hash values.
The package utilizing the Mutiple Hash task would be as below


The source file will look like this

The requirement is to transfer the details from the source file to dimension table. All except ReservationType attribute has to be undergo Type 1 processing and Reservation Type has to be Type 2 processed.
The package workflow would be as below
1. We will have a flat file source that points to the source file to extract the data from it.
2. A Multiple Hash task is used next to generate two hashing values one based on the Reservation Type column and second one based on other non key columns in the table. The algorithm used in this example is SHA1 with safe null handling enabled. We utilize these two hash values for comparison and doing Type 2 / Type 1 processing of the associated columns.
The Multiple Hash task configuration for the two hash values look like below

3. There are three lookup tasks used one after the other
The first lookup would be based on primary key field(s) ie ID in above case
This is to understand whether the record is a new record or an existing one. The NoMatch output of this lookup would be linked to OLEDB destination which will insert the new records.
In the case of  Match output records ie existing ones , we have three conditions
a. Records with no changes in values of any of the non key values ie hash values beng same we need to ignore these.
b. Records with change in ReservationType field value. These have to be Type 2 processed (add new entry + close existing entry). The table has two fields ValidFrom and ValidTo to indicate the validity of each record for historical (Type 2) representation. At any time only one record will have ValidTo as NULL which will store the present values of the Type 2 attributes.
c. Records with changes to any of the other non key columns. These need to be Type 1 processed (simple update)

For this purpose we link Match output of the primary key lookup task to second lookup which will try to match on ReservationType hash value field. The NonMatch output of this would include two OLEDB command steps one to close down existing record by updating ValidTo field with the current date value and second one to insert a new record with ReservationType as new value, ValidFrom as current date and ValidTo as NULL. The Matched output would then link to the third lookup task to see if atleast the other nonkey fields value have changed. The NoMatch output of this will be linked to OLEDB Command to update the values of other non key attributes which have changed. The Match output will be ignored as we dont want to do any processing for the unchanged records.

Method 2 : Applying HASHBYTES T-SQL function
-------------------------------------------------------------------------

T-SQL also has a HASHBYTES function which can be utilized in the above scenario.For that the first step would be to get the flat file data onto a staging table. This can be done using a simple data flow task in SSIS. Once the data is there in staging table, You can use another data flow task where you can do same set of lookups as in the above case to check for hash values and identify the column value changes.
The second data flow task would look almost same as the above except for the Multiple Hash step.

We need to be careful on couple of things while using the HASHBYTES function

1. Make sure we handle NULLs appropriately. The presence of NULL value in one of the columns can cause value to become NULL after concatenation so we need to make sure we use ISNULL or COALESCE to convert NULLs to blank values

2. Another thing we need to keep in mind is to make sure we identify column values separately by introducing a delimiter in between while concatenating. This is needed because there's a slight possibility of  concatenated value pattern being same though individual values are not

See this as an illustration

DECLARE @Col1 varchar(100)= 'Test',@Col2 varchar(100) = 'Code',@col3 varchar(100)='In Progress'
DECLARE @ColA varchar(100)= 'Test Code',@ColB varchar(100) = 'In',@colC varchar(100)='Progress'

SELECT HASHBYTES('SHA1',@Col1+ '  ' + @Col2+ '  ' + @col3),HASHBYTES('SHA1',@ColA+ '  ' + @ColB+ '  ' + @colC)

You can see that two hashed values match though column values are not the same


Now add a delimiter in between and see the result

DECLARE @Col1 varchar(100)= 'Test',@Col2 varchar(100) = 'Code',@col3 varchar(100)='In Progress'
DECLARE @ColA varchar(100)= 'Test Code',@ColB varchar(100) = 'In',@colC varchar(100)='Progress'

SELECT HASHBYTES('SHA1',@Col1+ ', ' + @Col2+ ', ' + @col3),HASHBYTES('SHA1',@ColA+ ', ' + @ColB+ ' ,' + @colC)



This method can be used to ensure uniqueness of hash value based on the values of the individual columns.

You may use any of the above methods to apply hashing algorithm for SCD processing in SSIS. The choice of method depends on whether its feasible to use third party component, whether staging environment is available etc.
Feel free to revert for any clarification. Enjoy coding!