snaps & snippets: February 2007

We've discussed the basics of XML in SQL, we've seen a few examples of XQueries and their performance; I guess we're ready to go a bit deeper. In this article we'll look more closely at XPath expressions and see how they behave in SQL Server 2005 XML Queries.

More XPaths than one lead to <Rome/>

First of all there are several ways of formulating a valid XPath expression – for a comprehensive overview of the subject please consult the W3C Recommendations on XPath (mentioned in previous posts) and other sources (e.g. the XPath Tutorial at W3Schools Online Web Tutorials).

Generally, we have two options ^{(with examples later in this article)}:

The exact path – where either the full path (starting at the root node) to a particular node (be it an element or an attribute) is specified (a.k.a. the absolute path) or the partial path to a particular node is specified depending on the current context (a.k.a. the relative path); or
The approximate path – where the path to the node is specified using XPath wildcards (e.g. "//" – anywhere, "element::*" or "*" – any element, "attribute::*" or "@*" – any attribute;¹ of course more on the subject can be found in the W3C Recommendations). Although wildcards can also be used as part of the absolute or the relative path, such XPath expressions cannot be considered exact.

Specifying the exact path is generally the fastest, but it requires quite an intimate knowledge of the XML in question, and may require existing XPath expressions (e.g. used in T-SQL modules) to be corrected if the XML structure is subsequently modified for whatever reason. Approximate paths, on the other hand, are less structure-dependent, but they suffer from a higher performance cost, especially in complex structures.

Of course neither of the two options may lead to the desired node if more nodes exist in a given XML that correspond to a particular XPath expression. It is therefore necessary to extend the expression in such cases with an appropriate XPath predicate.

If we were to compare XPath/XQuery expressions with T-SQL queries we could say that the XPath expression corresponds to the T-SQL SELECT and FROM clauses and the XPath predicate functions as the T-SQL WHERE clause. An appropriate XPath predicate is practically indispensable in situations where the XPath expression should point to a specific node – e.g. if a singleton is needed (as is the case with the value() XML method; we've seen such a case before).

For instance, using the following XML:

declare @example xml

set @example = N'<document>
 <title>Hamsters prefer pumpkin seeds</title>
 <paragraph>George B. Quick is a golden hamster.</paragraph>
 <paragraph>He''s quick when curious about his surroundings and even quicker
  when hungry.</paragraph>
 <paragraph>Although he''s presented with a wide variety of tasty and
  nutritious goodies each day, George prefers shiny pumpkin seeds.
  </paragraph>
</document>'

...all the following queries return the value of the title element using three different XPath expressions:

-- Exact (absolute) path
select @example.query(
   '/document/title'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Title

-- Approximate path using the "anywhere" wildcard
select @example.query(
   '//title'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Title

-- Approximate path using the "any element" wildcard
select @example.query(
   '/*/title'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Title

No predicates are necessary in this particular example as there is only a single title element in our XML. However, if we wanted to extract the value of a particular paragraph element we might end up with unexpected results:

select @example.query(
   '/document/paragraph'
    ).value(
     '.'
     ,'varchar(max)'
     ) as Paragraph

select @example.query(
   '//paragraph'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

select @example.query(
   '/*/paragraph'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

All three queries return a concatenation of values from all paragraph elements. While this may be valid for some it may not be valid for all queries. What can we do? Well, with an appropriate predicate we can restrict the XML Query result. For instance, we can qualify the XPath expression using the paragraph element's position:

-- Extract the value of the second paragraph...

-- ...using the full syntax for the predicate:
select @example.query(
   '//paragraph[position() = 2]'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

-- ...using shorthand:
select @example.query(
   '//paragraph[2]'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

Although this method is frequently used in printed and online references on the subject of querying XML data (including Books Online!) I strongly advice against it, as it serves as a doorway to a whole world of possible problems. According to the XML specification the original order of elements need not be preserved (if several elements of the same name are allowed in a given XML) – meaning that it is perfectly legal to change element positions, to insert additional elements anywhere inside the existing element island(s), etc. Even XML Schema won't prevent this.

Being presented with an unfortunate XML as the one in our example there is little we can do in XML queries to prevent unexpected or incorrect results due to changes in the XML. If, on the other hand, the XML structure includes a means of uniquely identifying individual nodes, we can use more resilient XPath expressions:

declare @example xml

set @example = N'<document>
 <title>Hamsters prefer pumpkin seeds</title>
 <paragraph id="1">George B. Quick is a golden hamster.</paragraph>
 <paragraph id="2">He''s quick when curious about his surroundings and even
  quicker when hungry.</paragraph>
 <paragraph id="3">Although he''s presented with a wide variety of tasty and
  nutritious goodies each day, George prefers shiny pumpkin seeds.
  </paragraph>
</document>'

-- Exact (absolute) path using the appropriate predicate
select @example.query(
   '/document/paragraph[@id = 2]'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

-- Approximate path using the "anywhere" wildcard and the appropriate predicate
select @example.query(
   '//paragraph[@id = 2]'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

-- Approximate path using the "any element" wildcard and the appropriate predicate
select @example.query(
   '//*[@id = 2]'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

In the example above we've modified the XML by adding attributes that uniquely identify each repeated element – this way we can use XPath expressions that make the query no longer depend on element positions, but rather on more reliable properties. Of course this is not the only method that can be used nor may it be the most appropriate in all situations. We can quite clearly see that data integrity does start in the XML, and typing the XML (e.g. with XML Schema) is only one of the many steps we have to take to maintain it.

Undoubtedly, data integrity is important, but we should not neglect the performance of data retrieval.

Which XPath is the quickest?

In our performance test² we will focus on the use of XML retrieval methods in query conditions – i.e. a comparison of various XPath expressions in the WHERE clause and their effect on performance. The queries used in this test are listed in footnotes at the bottom of this article. Please, refer to Part Two of the series for DDL and sample data. Note that both tables are XML-indexed and appropriate XML schema collections are used to type XML columns.

Performance ^{(All XML indexes)}
Query	EAV Schema		Query	Typed Schema
Query	CPU	Reads	Query	CPU	Reads
#1	29260	1180945	#7	11837	403710
#1	25847	1181142	#7	10846	403706
#1	28161	1180896	#7	11206	403706
#1	30845	1180896	#7	12378	403706
#1	31225	1180896	#7	12839	403724
#1	30224	1180896	#7	11917	403706
#2	29929	1180896	#8	13175	456104
#2	28351	1180896	#8	13680	456104
#2	26628	1180896	#8	12087	456104
#2	31945	1180896	#8	13439	456104
#2	31946	1180896	#8	13519	456104
#2	30774	1180896	#8	13149	456104
#3	28148	884298	#9	18164	764994
#3	27369	884298	#9	17996	764992
#3	27560	884298	#9	18397	765004
#3	27880	884298	#9	17876	764992
#3	29873	884298	#9	18807	764992
#3	28060	884298	#9	17746	764992

As we've already seen in previous articles the performance of XML queries in comparison to T-SQL queries is quite poor. However, it should be apparent now that we can actually improve the performance of XML retrieval simply by using more appropriate XPath expressions.

For the EAV XML type approximate XPath expressions seem to be less resource-intensive, while for the strongly typed XML using an exact path will yield better performance. Of course, once again it seems that SQL Server prefers typed XML.

Anyway, we're not quite done yet. Believe it or not, there still remains some room for improvement. And it won't require any changes to the XML or the XML Schema – the XPath expressions can be further improved. How about halving CPU time and decimating reads?

Performance ^{(All XML indexes)}
Query	EAV Schema		Query	Typed Schema
Query	CPU	Reads	Query	CPU	Reads
#4	8995	148077	#10	5670	87871
#4	8922	148077	#10	5648	87871
#4	8823	148077	#10	5879	87871
#4	9293	148077	#10	5678	87871
#4	8893	148077	#10	5487	87871
#4	9043	148077	#10	5658	87871
#5	9035	148077	#11	5740	87871
#5	8993	148077	#11	5598	87871
#5	9073	148077	#11	5668	87871
#5	8893	148077	#11	5778	87871
#5	9143	148077	#11	5749	87871
#5	9073	148077	#11	5909	87871
#6	12332	228022	#12	10497	242122
#6	12178	228031	#12	10115	242122
#6	12087	228019	#12	10084	242122
#6	12388	228023	#12	10726	242122
#6	12518	228019	#12	10835	242122
#6	12488	228019	#12	10725	242122

By using even more exact XPath expressions we can reduce CPU time by 2 to 3 times and reduce the number of reads by 3 to 8 times! This *is* without a doubt a significant amount.

But where's the catch?

Well, where does the value of a simple (or mixed) XML element actually reside? In its text node, that's where. It is therefore logical to look for it right there and nowhere else, and apparently this is exactly how XML data retrieval works in SQL: by specifying an XPath expression that is as exact as possible we allow the database engine to navigate past other possible locations (e.g. child nodes) and go directly to the intended target (the bull's-eye).

Even though our target element holds no descendants (and – according to XML schema collections being used – isn't allowed to hold any) the "/attributeCollection/firstName" XPath expression actually points to the entire element (i.e. a node-set, not just a node!), while the "/attributeCollection/firstName/text()" XPath expression points to a single final node. Well, this is what I've observed – after all, I haven't designed the darn thing. ;) It makes perfect sense, though.

Mental aggregation

What have we learned?

First of all, serious XML use starts with serious XML modeling: most of the considerations related to an efficient SQL data model are necessary for efficient XML models as well. Use the XML Schema to provide domain integrity (XDR, although supported by SQL Server 2005, is fast becoming obsolete), use appropriate relational modeling (yes, relational!) inside your XML Schema, and consider all relevant aspects of XML use *before* you introduce your XML model into your SQL model;
Second, know your XML: XML Queries should reflect the expected structure of XML data – consider the XML Schema as your guide through the XML. Can you afford to use incorrect data?
Third, really know your XML: the difference between exact and approximate XPath expressions may be insignificant on small quantities of data, but may grow dramatically on large quantities. SQL Server prefers the exact XPath and so should you;
And last but not least, if your particular business case requires an XML model that can also be modeled in SQL, use SQL instead; especially so if a SQL pattern or practice that matches your requirements already exists. After all, the superiority of SQL over XML regarding data modeling, storage and retrieval remains undisputed.

Next time we'll take a look at something rather unexpected and quite useful (in some respects).

¹ Quotation marks are not part of the XPath expression.
² You'll need this for the rest of the queries to work (in addition to DDL and sample data available here):
```
declare @firstName varchar(8)
set @firstName = 'Karen'
```

Query #1 (EAV Type, absolute path):

select ANV.AttributeId as AttributeId
 from dbo.AttributesNameValue ANV
 where (ANV.Attribute.query(
    '/attributeCollection/attribute[name = "firstName"]/value'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #2 (EAV Type, approximate path):

select ANV.AttributeId as AttributeId
 from dbo.AttributesNameValue ANV
 where (ANV.Attribute.query(
    '//attribute[name = "firstName"]/value'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #3 (EAV Type, an "even more approximate" path):

select ANV.AttributeId as AttributeId
 from dbo.AttributesNameValue ANV
 where (ANV.Attribute.query(
    '//value[../name = "firstName"]'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #4 (EAV Type, the "bull's-eye" path):

select ANV.AttributeId as AttributeId
 from dbo.AttributesNameValue ANV
 where (ANV.Attribute.query(
    '/attributeCollection/attribute[name/text() = "firstName"]/value/text()'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #5 (EAV Type, the "near bull's-eye" path):

select ANV.AttributeId as AttributeId
 from dbo.AttributesNameValue ANV
 where (ANV.Attribute.query(
    '//attribute[name/text() = "firstName"]/value/text()'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #6 (EAV Type, the "in the neighbourhood of the bull's-eye" path):

select ANV.AttributeId as AttributeId
 from dbo.AttributesNameValue ANV
 where (ANV.Attribute.query(
    '//value[../name/text() = "firstName"]/text()'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #7 (Strong Type, absolute path):

select AT.AttributeId as AttributeId
 from dbo.AttributesTyped AT
 where (AT.Attribute.query(
    '/attributeCollection/firstName'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #8 (Strong Type, approximate path):

select AT.AttributeId as AttributeId
 from dbo.AttributesTyped AT
 where (AT.Attribute.query(
    '//firstName'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #9 (Strong Path, the "ridiculously complicated" path):

select AT.AttributeId as AttributeId
 from dbo.AttributesTyped AT
 where (AT.Attribute.query(
    '//*[local-name() = "firstName"]'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #10 (Strong Type, the "bulls's-eye" path):

select AT.AttributeId as AttributeId
 from dbo.AttributesTyped AT
 where (AT.Attribute.query(
    '/attributeCollection/firstName/text()'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #11 (Strong Type, the "near bull's-eye" path):

select AT.AttributeId as AttributeId
 from dbo.AttributesTyped AT
 where (AT.Attribute.query(
    '//firstName/text()'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

Query #12 (Strong Type, the "grasping at straws" path):

select AT.AttributeId as AttributeId
 from dbo.AttributesTyped AT
 where (AT.Attribute.query(
    '//*[local-name() = "firstName"]/text()'
    ).value(
     '.'
     ,'varchar(max)'
     ) = @firstName)

snaps & snippets

Monday, February 19, 2007

SQL Server 2005 Service Pack 2

Wednesday, February 14, 2007

SQL Server 2005 XML Methods, Part Three, The XPath Expression

More XPaths than one lead to <Rome/>

Which XPath is the quickest?

Mental aggregation