Monday, February 19, 2007

SQL Server 2005 Service Pack 2

After the December 2006 CTP of SQL Server 2005 SP2 the second Service Pack has RTM'ed today and is ready for download at the official site.

The official site contains links to individual downloads, here are a few essential short cuts:

The build number of SP2 is 3042 (the full version number is therefore 9.00.3042).

I guess this means that Windows Vista installations can finally be improved with a state-of-the-art RDBMS product. ;)


ML

Wednesday, February 14, 2007

SQL Server 2005 XML Methods, Part Three, The XPath Expression

We've discussed the basics of XML in SQL, we've seen a few examples of XQueries and their performance; I guess we're ready to go a bit deeper. In this article we'll look more closely at XPath expressions and see how they behave in SQL Server 2005 XML Queries.


More XPaths than one lead to <Rome/>

First of all there are several ways of formulating a valid XPath expression – for a comprehensive overview of the subject please consult the W3C Recommendations on XPath (mentioned in previous posts) and other sources (e.g. the XPath Tutorial at W3Schools Online Web Tutorials).

Generally, we have two options (with examples later in this article):

  • The exact path – where either the full path (starting at the root node) to a particular node (be it an element or an attribute) is specified (a.k.a. the absolute path) or the partial path to a particular node is specified depending on the current context (a.k.a. the relative path); or
  • The approximate path – where the path to the node is specified using XPath wildcards (e.g. "//" – anywhere, "element::*" or "*" – any element, "attribute::*" or "@*" – any attribute;1 of course more on the subject can be found in the W3C Recommendations). Although wildcards can also be used as part of the absolute or the relative path, such XPath expressions cannot be considered exact.

Specifying the exact path is generally the fastest, but it requires quite an intimate knowledge of the XML in question, and may require existing XPath expressions (e.g. used in T-SQL modules) to be corrected if the XML structure is subsequently modified for whatever reason. Approximate paths, on the other hand, are less structure-dependent, but they suffer from a higher performance cost, especially in complex structures.

Of course neither of the two options may lead to the desired node if more nodes exist in a given XML that correspond to a particular XPath expression. It is therefore necessary to extend the expression in such cases with an appropriate XPath predicate.

If we were to compare XPath/XQuery expressions with T-SQL queries we could say that the XPath expression corresponds to the T-SQL SELECT and FROM clauses and the XPath predicate functions as the T-SQL WHERE clause. An appropriate XPath predicate is practically indispensable in situations where the XPath expression should point to a specific node – e.g. if a singleton is needed (as is the case with the value() XML method; we've seen such a case before).

For instance, using the following XML:

declare @example xml

set @example = N'<document>
 <title>Hamsters prefer pumpkin seeds</title>
 <paragraph>George B. Quick is a golden hamster.</paragraph>
 <paragraph>He''s quick when curious about his surroundings and even quicker
  when hungry.</paragraph>
 <paragraph>Although he''s presented with a wide variety of tasty and
  nutritious goodies each day, George prefers shiny pumpkin seeds.
  </paragraph>
</document>'

...all the following queries return the value of the title element using three different XPath expressions:

-- Exact (absolute) path
select @example.query(
   '/document/title'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Title

-- Approximate path using the "anywhere" wildcard
select @example.query(
   '//title'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Title

-- Approximate path using the "any element" wildcard
select @example.query(
   '/*/title'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Title

No predicates are necessary in this particular example as there is only a single title element in our XML. However, if we wanted to extract the value of a particular paragraph element we might end up with unexpected results:

select @example.query(
   '/document/paragraph'
    ).value(
     '.'
     ,'varchar(max)'
     ) as Paragraph

select @example.query(
   '//paragraph'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

select @example.query(
   '/*/paragraph'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

All three queries return a concatenation of values from all paragraph elements. While this may be valid for some it may not be valid for all queries. What can we do? Well, with an appropriate predicate we can restrict the XML Query result. For instance, we can qualify the XPath expression using the paragraph element's position:

-- Extract the value of the second paragraph...

-- ...using the full syntax for the predicate:
select @example.query(
   '//paragraph[position() = 2]'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

-- ...using shorthand:
select @example.query(
   '//paragraph[2]'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

Although this method is frequently used in printed and online references on the subject of querying XML data (including Books Online!) I strongly advice against it, as it serves as a doorway to a whole world of possible problems. According to the XML specification the original order of elements need not be preserved (if several elements of the same name are allowed in a given XML) – meaning that it is perfectly legal to change element positions, to insert additional elements anywhere inside the existing element island(s), etc. Even XML Schema won't prevent this.

Being presented with an unfortunate XML as the one in our example there is little we can do in XML queries to prevent unexpected or incorrect results due to changes in the XML. If, on the other hand, the XML structure includes a means of uniquely identifying individual nodes, we can use more resilient XPath expressions:

declare @example xml

set @example = N'<document>
 <title>Hamsters prefer pumpkin seeds</title>
 <paragraph id="1">George B. Quick is a golden hamster.</paragraph>
 <paragraph id="2">He''s quick when curious about his surroundings and even
  quicker when hungry.</paragraph>
 <paragraph id="3">Although he''s presented with a wide variety of tasty and
  nutritious goodies each day, George prefers shiny pumpkin seeds.
  </paragraph>
</document>'

-- Exact (absolute) path using the appropriate predicate
select @example.query(
   '/document/paragraph[@id = 2]'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

-- Approximate path using the "anywhere" wildcard and the appropriate predicate
select @example.query(
   '//paragraph[@id = 2]'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

-- Approximate path using the "any element" wildcard and the appropriate predicate
select @example.query(
   '//*[@id = 2]'
   ).value(
    '.'
    ,'varchar(max)'
    ) as Paragraph

In the example above we've modified the XML by adding attributes that uniquely identify each repeated element – this way we can use XPath expressions that make the query no longer depend on element positions, but rather on more reliable properties. Of course this is not the only method that can be used nor may it be the most appropriate in all situations. We can quite clearly see that data integrity does start in the XML, and typing the XML (e.g. with XML Schema) is only one of the many steps we have to take to maintain it.

Undoubtedly, data integrity is important, but we should not neglect the performance of data retrieval.


Which XPath is the quickest?

In our performance test2 we will focus on the use of XML retrieval methods in query conditions – i.e. a comparison of various XPath expressions in the WHERE clause and their effect on performance. The queries used in this test are listed in footnotes at the bottom of this article. Please, refer to Part Two of the series for DDL and sample data. Note that both tables are XML-indexed and appropriate XML schema collections are used to type XML columns.

Performance
(All XML indexes)
Query EAV Schema Query Typed Schema
CPU Reads CPU Reads
#1 29260 1180945 #7 11837 403710
#2 29929 1180896 #8 13175 456104
#3 28148 884298 #9 18164 764994

As we've already seen in previous articles the performance of XML queries in comparison to T-SQL queries is quite poor. However, it should be apparent now that we can actually improve the performance of XML retrieval simply by using more appropriate XPath expressions.

For the EAV XML type approximate XPath expressions seem to be less resource-intensive, while for the strongly typed XML using an exact path will yield better performance. Of course, once again it seems that SQL Server prefers typed XML.

Anyway, we're not quite done yet. Believe it or not, there still remains some room for improvement. And it won't require any changes to the XML or the XML Schema – the XPath expressions can be further improved. How about halving CPU time and decimating reads?

Performance
(All XML indexes)
Query EAV Schema Query Typed Schema
CPU Reads CPU Reads
#4 8995 148077 #10 5670 87871
#5 9035 148077 #11 5740 87871
#6 12332 228022 #12 10497 242122

By using even more exact XPath expressions we can reduce CPU time by 2 to 3 times and reduce the number of reads by 3 to 8 times! This *is* without a doubt a significant amount.

But where's the catch?

Well, where does the value of a simple (or mixed) XML element actually reside? In its text node, that's where. It is therefore logical to look for it right there and nowhere else, and apparently this is exactly how XML data retrieval works in SQL: by specifying an XPath expression that is as exact as possible we allow the database engine to navigate past other possible locations (e.g. child nodes) and go directly to the intended target (the bull's-eye).

Even though our target element holds no descendants (and – according to XML schema collections being used – isn't allowed to hold any) the "/attributeCollection/firstName" XPath expression actually points to the entire element (i.e. a node-set, not just a node!), while the "/attributeCollection/firstName/text()" XPath expression points to a single final node. Well, this is what I've observed – after all, I haven't designed the darn thing. ;) It makes perfect sense, though.


Mental aggregation

What have we learned?

  1. First of all, serious XML use starts with serious XML modeling: most of the considerations related to an efficient SQL data model are necessary for efficient XML models as well. Use the XML Schema to provide domain integrity (XDR, although supported by SQL Server 2005, is fast becoming obsolete), use appropriate relational modeling (yes, relational!) inside your XML Schema, and consider all relevant aspects of XML use *before* you introduce your XML model into your SQL model;
  2. Second, know your XML: XML Queries should reflect the expected structure of XML data – consider the XML Schema as your guide through the XML. Can you afford to use incorrect data?
  3. Third, really know your XML: the difference between exact and approximate XPath expressions may be insignificant on small quantities of data, but may grow dramatically on large quantities. SQL Server prefers the exact XPath and so should you;
  4. And last but not least, if your particular business case requires an XML model that can also be modeled in SQL, use SQL instead; especially so if a SQL pattern or practice that matches your requirements already exists. After all, the superiority of SQL over XML regarding data modeling, storage and retrieval remains undisputed.

Next time we'll take a look at something rather unexpected and quite useful (in some respects).


ML


  • 1 Quotation marks are not part of the XPath expression.
  • 2 You'll need this for the rest of the queries to work (in addition to DDL and sample data available here):
    declare @firstName varchar(8)
    set @firstName = 'Karen'
  • Query #1 (EAV Type, absolute path):
    select ANV.AttributeId as AttributeId
     from dbo.AttributesNameValue ANV
     where (ANV.Attribute.query(
        '/attributeCollection/attribute[name = "firstName"]/value'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #2 (EAV Type, approximate path):
    select ANV.AttributeId as AttributeId
     from dbo.AttributesNameValue ANV
     where (ANV.Attribute.query(
        '//attribute[name = "firstName"]/value'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #3 (EAV Type, an "even more approximate" path):
    select ANV.AttributeId as AttributeId
     from dbo.AttributesNameValue ANV
     where (ANV.Attribute.query(
        '//value[../name = "firstName"]'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #4 (EAV Type, the "bull's-eye" path):
    select ANV.AttributeId as AttributeId
     from dbo.AttributesNameValue ANV
     where (ANV.Attribute.query(
        '/attributeCollection/attribute[name/text() = "firstName"]/value/text()'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #5 (EAV Type, the "near bull's-eye" path):
    select ANV.AttributeId as AttributeId
     from dbo.AttributesNameValue ANV
     where (ANV.Attribute.query(
        '//attribute[name/text() = "firstName"]/value/text()'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #6 (EAV Type, the "in the neighbourhood of the bull's-eye" path):
    select ANV.AttributeId as AttributeId
     from dbo.AttributesNameValue ANV
     where (ANV.Attribute.query(
        '//value[../name/text() = "firstName"]/text()'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #7 (Strong Type, absolute path):
    select AT.AttributeId as AttributeId
     from dbo.AttributesTyped AT
     where (AT.Attribute.query(
        '/attributeCollection/firstName'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #8 (Strong Type, approximate path):
    select AT.AttributeId as AttributeId
     from dbo.AttributesTyped AT
     where (AT.Attribute.query(
        '//firstName'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #9 (Strong Path, the "ridiculously complicated" path):
    select AT.AttributeId as AttributeId
     from dbo.AttributesTyped AT
     where (AT.Attribute.query(
        '//*[local-name() = "firstName"]'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #10 (Strong Type, the "bulls's-eye" path):
    select AT.AttributeId as AttributeId
     from dbo.AttributesTyped AT
     where (AT.Attribute.query(
        '/attributeCollection/firstName/text()'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #11 (Strong Type, the "near bull's-eye" path):
    select AT.AttributeId as AttributeId
     from dbo.AttributesTyped AT
     where (AT.Attribute.query(
        '//firstName/text()'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)
  • Query #12 (Strong Type, the "grasping at straws" path):
    select AT.AttributeId as AttributeId
     from dbo.AttributesTyped AT
     where (AT.Attribute.query(
        '//*[local-name() = "firstName"]/text()'
        ).value(
         '.'
         ,'varchar(max)'
         ) = @firstName)