Tuesday, January 29, 2013

String pattern search - XML based method

Its a common scenario these days to search for a string value within a delimited list. Though its not a good design to store delimited values inside a table field, I've come across numerous cases where I've inherited systems with table fields like this. This blog explains you an approach to achieve this using XML functions available in T-SQL.
Consider the following illustration with tables table1 and table2


declare @t table
(
id int,
field varchar(20)
)

insert @t
values(1,'test'),
(2,'giu'),
(3,'piece'),
(4,'csv'),
(5,'master'),
(7,'pattern')

declare @t1 table
(
id int,
list varchar(1000)
)

insert @t1
values (1,'this,is,test,for,csv,search,piece'),
(2,'finding,string,pattern,within,given,csv,value')


Now the attempt is to search for words in text field of @t against csv value in @t1 and return matching details.
The conventional way of doing this by making use CSV split UDF and then using join logic to get matching words. The CSV spilt function would be like below


The same effect can be obtained by making a XML out of the CSV content and then using XML functions to parse and compare the data. The solution can be given as below




Now lets see the output
As you see from above the output does searching for pattern with csv as expected.
Now lets analyze the above solution. The first step is to convert the csv string onto a valid XML value. This is for taking advantage of  xquery functions for the string search. The xml is formed by adding fake nodes to string by means of replace function to get a xml string like below



Once this is done, then xquery function exist() is used to search for the field value within each of the xml nodes. This gives the same effect as searching for field pattern from @t table within the csv column list of  @t1. This is the same logic we used for xml based parsing of delimited string as outlined here
http://visakhm.blogspot.in/2013/01/delimited-string-split-xml-parsing.html
The table columns can be passed as an argument within xquery by using sql:column() function. the exist() function will give 1 only in cases where passed column value matches a node value within the created xml. This filter condition would ensure we get only matched string values in the output for the csv lists.
EDIT:
This can be extended for pattern searching using contains as well
For that ,the above solution needs to be tweaked as


declare @t table
(
id int,
field varchar(20)
)

insert @t
values(1,'test'),
(2,'giu'),
(3,'pieceofmeal'),
(4,'csvtext'),
(5,'master'),
(7,'pattern')

declare @t1 table
(
id int,
list varchar(1000)
)

insert @t1
values (1,'this,is,test,for,csv,search,piece'),
(2,'finding,string,pattern,within,given,csv,value')



The output will be as shown



This is a cool approach which can be used for string value search within csv. For moderate dataset this approach performed much better than conventional string splitting UDF approach for me. I'll be putting a follow up post with my comparison results for the various string searching methods soon.

No comments:

Post a Comment