We've discussed the basics of XML in SQL, we've seen a few examples of XQueries and their performance; I guess we're ready to go a bit deeper. In this article we'll look more closely at XPath expressions and see how they behave in SQL Server 2005 XML Queries.
More XPaths than one lead to <Rome/>
First of all there are several ways of formulating a valid XPath expression – for a comprehensive overview of the subject please consult the W3C Recommendations on XPath (mentioned in previous posts) and other sources (e.g. the XPath Tutorial at W3Schools Online Web Tutorials).
Generally, we have two options (with examples later in this article):
- The exact path – where either the full path (starting at the root node) to a particular node (be it an element or an attribute) is specified (a.k.a. the absolute path) or the partial path to a particular node is specified depending on the current context (a.k.a. the relative path); or
- The approximate path – where the path to the node is specified using XPath wildcards (e.g. "
//
" – anywhere, "element::*
" or "*
" – any element, "attribute::*
" or "@*
" – any attribute;1 of course more on the subject can be found in the W3C Recommendations). Although wildcards can also be used as part of the absolute or the relative path, such XPath expressions cannot be considered exact.
Specifying the exact path is generally the fastest, but it requires quite an intimate knowledge of the XML in question, and may require existing XPath expressions (e.g. used in T-SQL modules) to be corrected if the XML structure is subsequently modified for whatever reason. Approximate paths, on the other hand, are less structure-dependent, but they suffer from a higher performance cost, especially in complex structures.
Of course neither of the two options may lead to the desired node if more nodes exist in a given XML that correspond to a particular XPath expression. It is therefore necessary to extend the expression in such cases with an appropriate XPath predicate.
If we were to compare XPath/XQuery expressions with T-SQL queries we could say that the XPath expression corresponds to the T-SQL SELECT
and FROM
clauses and the XPath predicate functions as the T-SQL WHERE
clause. An appropriate XPath predicate is practically indispensable in situations where the XPath expression should point to a specific node – e.g. if a singleton is needed (as is the case with the value()
XML method; we've seen such a case before).
For instance, using the following XML:
declare @example xml
set @example = N'<document>
<title>Hamsters prefer pumpkin seeds</title>
<paragraph>George B. Quick is a golden hamster.</paragraph>
<paragraph>He''s quick when curious about his surroundings and even quicker
when hungry.</paragraph>
<paragraph>Although he''s presented with a wide variety of tasty and
nutritious goodies each day, George prefers shiny pumpkin seeds.
</paragraph>
</document>'
...all the following queries return the value of the title
element using three different XPath expressions:
-- Exact (absolute) path
select @example.query(
'/document/title'
).value(
'.'
,'varchar(max)'
) as Title
-- Approximate path using the "anywhere" wildcard
select @example.query(
'//title'
).value(
'.'
,'varchar(max)'
) as Title
-- Approximate path using the "any element" wildcard
select @example.query(
'/*/title'
).value(
'.'
,'varchar(max)'
) as Title
No predicates are necessary in this particular example as there is only a single title
element in our XML. However, if we wanted to extract the value of a particular paragraph
element we might end up with unexpected results:
select @example.query(
'/document/paragraph'
).value(
'.'
,'varchar(max)'
) as Paragraph
select @example.query(
'//paragraph'
).value(
'.'
,'varchar(max)'
) as Paragraph
select @example.query(
'/*/paragraph'
).value(
'.'
,'varchar(max)'
) as Paragraph
All three queries return a concatenation of values from all paragraph
elements. While this may be valid for some it may not be valid for all queries. What can we do? Well, with an appropriate predicate we can restrict the XML Query result. For instance, we can qualify the XPath expression using the paragraph
element's position:
-- Extract the value of the second paragraph...
-- ...using the full syntax for the predicate:
select @example.query(
'//paragraph[position() = 2]'
).value(
'.'
,'varchar(max)'
) as Paragraph
-- ...using shorthand:
select @example.query(
'//paragraph[2]'
).value(
'.'
,'varchar(max)'
) as Paragraph
Although this method is frequently used in printed and online references on the subject of querying XML data (including Books Online!) I strongly advice against it, as it serves as a doorway to a whole world of possible problems. According to the XML specification the original order of elements need not be preserved (if several elements of the same name are allowed in a given XML) – meaning that it is perfectly legal to change element positions, to insert additional elements anywhere inside the existing element island(s), etc. Even XML Schema won't prevent this.
Being presented with an unfortunate XML as the one in our example there is little we can do in XML queries to prevent unexpected or incorrect results due to changes in the XML. If, on the other hand, the XML structure includes a means of uniquely identifying individual nodes, we can use more resilient XPath expressions:
declare @example xml
set @example = N'<document>
<title>Hamsters prefer pumpkin seeds</title>
<paragraph id="1">George B. Quick is a golden hamster.</paragraph>
<paragraph id="2">He''s quick when curious about his surroundings and even
quicker when hungry.</paragraph>
<paragraph id="3">Although he''s presented with a wide variety of tasty and
nutritious goodies each day, George prefers shiny pumpkin seeds.
</paragraph>
</document>'
-- Exact (absolute) path using the appropriate predicate
select @example.query(
'/document/paragraph[@id = 2]'
).value(
'.'
,'varchar(max)'
) as Paragraph
-- Approximate path using the "anywhere" wildcard and the appropriate predicate
select @example.query(
'//paragraph[@id = 2]'
).value(
'.'
,'varchar(max)'
) as Paragraph
-- Approximate path using the "any element" wildcard and the appropriate predicate
select @example.query(
'//*[@id = 2]'
).value(
'.'
,'varchar(max)'
) as Paragraph
In the example above we've modified the XML by adding attributes that uniquely identify each repeated element – this way we can use XPath expressions that make the query no longer depend on element positions, but rather on more reliable properties. Of course this is not the only method that can be used nor may it be the most appropriate in all situations. We can quite clearly see that data integrity does start in the XML, and typing the XML (e.g. with XML Schema) is only one of the many steps we have to take to maintain it.
Undoubtedly, data integrity is important, but we should not neglect the performance of data retrieval.
Which XPath is the quickest?
In our performance test2 we will focus on the use of XML retrieval methods in query conditions – i.e. a comparison of various XPath expressions in the WHERE
clause and their effect on performance. The queries used in this test are listed in footnotes at the bottom of this article. Please, refer to Part Two of the series for DDL and sample data. Note that both tables are XML-indexed and appropriate XML schema collections are used to type XML columns.
Performance (All XML indexes) |
Query |
EAV Schema |
|
Query |
Typed Schema |
CPU |
Reads |
|
CPU |
Reads |
#1 |
29260 |
1180945 |
|
#7 |
11837 |
403710 |
#1 |
25847 |
1181142 |
|
#7 |
10846 |
403706 |
#1 |
28161 |
1180896 |
|
#7 |
11206 |
403706 |
#1 |
30845 |
1180896 |
|
#7 |
12378 |
403706 |
#1 |
31225 |
1180896 |
|
#7 |
12839 |
403724 |
#1 |
30224 |
1180896 |
|
#7 |
11917 |
403706 |
#2 |
29929 |
1180896 |
|
#8 |
13175 |
456104 |
#2 |
28351 |
1180896 |
|
#8 |
13680 |
456104 |
#2 |
26628 |
1180896 |
|
#8 |
12087 |
456104 |
#2 |
31945 |
1180896 |
|
#8 |
13439 |
456104 |
#2 |
31946 |
1180896 |
|
#8 |
13519 |
456104 |
#2 |
30774 |
1180896 |
|
#8 |
13149 |
456104 |
#3 |
28148 |
884298 |
|
#9 |
18164 |
764994 |
#3 |
27369 |
884298 |
|
#9 |
17996 |
764992 |
#3 |
27560 |
884298 |
|
#9 |
18397 |
765004 |
#3 |
27880 |
884298 |
|
#9 |
17876 |
764992 |
#3 |
29873 |
884298 |
|
#9 |
18807 |
764992 |
#3 |
28060 |
884298 |
|
#9 |
17746 |
764992 |
As we've already seen in previous articles the performance of XML queries in comparison to T-SQL queries is quite poor. However, it should be apparent now that we can actually improve the performance of XML retrieval simply by using more appropriate XPath expressions.
For the EAV XML type approximate XPath expressions seem to be less resource-intensive, while for the strongly typed XML using an exact path will yield better performance. Of course, once again it seems that SQL Server prefers typed XML.
Anyway, we're not quite done yet. Believe it or not, there still remains some room for improvement. And it won't require any changes to the XML or the XML Schema – the XPath expressions can be further improved. How about halving CPU time and decimating reads?
Performance (All XML indexes) |
Query |
EAV Schema |
|
Query |
Typed Schema |
CPU |
Reads |
|
CPU |
Reads |
#4 |
8995 |
148077 |
|
#10 |
5670 |
87871 |
#4 |
8922 |
148077 |
|
#10 |
5648 |
87871 |
#4 |
8823 |
148077 |
|
#10 |
5879 |
87871 |
#4 |
9293 |
148077 |
|
#10 |
5678 |
87871 |
#4 |
8893 |
148077 |
|
#10 |
5487 |
87871 |
#4 |
9043 |
148077 |
|
#10 |
5658 |
87871 |
#5 |
9035 |
148077 |
|
#11 |
5740 |
87871 |
#5 |
8993 |
148077 |
|
#11 |
5598 |
87871 |
#5 |
9073 |
148077 |
|
#11 |
5668 |
87871 |
#5 |
8893 |
148077 |
|
#11 |
5778 |
87871 |
#5 |
9143 |
148077 |
|
#11 |
5749 |
87871 |
#5 |
9073 |
148077 |
|
#11 |
5909 |
87871 |
#6 |
12332 |
228022 |
|
#12 |
10497 |
242122 |
#6 |
12178 |
228031 |
|
#12 |
10115 |
242122 |
#6 |
12087 |
228019 |
|
#12 |
10084 |
242122 |
#6 |
12388 |
228023 |
|
#12 |
10726 |
242122 |
#6 |
12518 |
228019 |
|
#12 |
10835 |
242122 |
#6 |
12488 |
228019 |
|
#12 |
10725 |
242122 |
By using even more exact XPath expressions we can reduce CPU time by 2 to 3 times and reduce the number of reads by 3 to 8 times! This *is* without a doubt a significant amount.
But where's the catch?
Well, where does the value of a simple (or mixed) XML element actually reside? In its text
node, that's where. It is therefore logical to look for it right there and nowhere else, and apparently this is exactly how XML data retrieval works in SQL: by specifying an XPath expression that is as exact as possible we allow the database engine to navigate past other possible locations (e.g. child nodes) and go directly to the intended target (the bull's-eye).
Even though our target element holds no descendants (and – according to XML schema collections being used – isn't allowed to hold any) the "/attributeCollection/firstName
" XPath expression actually points to the entire element (i.e. a node-set, not just a node!), while the "/attributeCollection/firstName/text()
" XPath expression points to a single final node. Well, this is what I've observed – after all, I haven't designed the darn thing. ;) It makes perfect sense, though.
Mental aggregation
What have we learned?
- First of all, serious XML use starts with serious XML modeling: most of the considerations related to an efficient SQL data model are necessary for efficient XML models as well. Use the XML Schema to provide domain integrity (XDR, although supported by SQL Server 2005, is fast becoming obsolete), use appropriate relational modeling (yes, relational!) inside your XML Schema, and consider all relevant aspects of XML use *before* you introduce your XML model into your SQL model;
- Second, know your XML: XML Queries should reflect the expected structure of XML data – consider the XML Schema as your guide through the XML. Can you afford to use incorrect data?
- Third, really know your XML: the difference between exact and approximate XPath expressions may be insignificant on small quantities of data, but may grow dramatically on large quantities. SQL Server prefers the exact XPath and so should you;
- And last but not least, if your particular business case requires an XML model that can also be modeled in SQL, use SQL instead; especially so if a SQL pattern or practice that matches your requirements already exists. After all, the superiority of SQL over XML regarding data modeling, storage and retrieval remains undisputed.
Next time we'll take a look at something rather unexpected and quite useful (in some respects).
ML
- 1 Quotation marks are not part of the XPath expression.
- 2 You'll need this for the rest of the queries to work (in addition to DDL and sample data available here):
declare @firstName varchar(8)
set @firstName = 'Karen'
- Query #1 (EAV Type, absolute path):
select ANV.AttributeId as AttributeId
from dbo.AttributesNameValue ANV
where (ANV.Attribute.query(
'/attributeCollection/attribute[name = "firstName"]/value'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #2 (EAV Type, approximate path):
select ANV.AttributeId as AttributeId
from dbo.AttributesNameValue ANV
where (ANV.Attribute.query(
'//attribute[name = "firstName"]/value'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #3 (EAV Type, an "even more approximate" path):
select ANV.AttributeId as AttributeId
from dbo.AttributesNameValue ANV
where (ANV.Attribute.query(
'//value[../name = "firstName"]'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #4 (EAV Type, the "bull's-eye" path):
select ANV.AttributeId as AttributeId
from dbo.AttributesNameValue ANV
where (ANV.Attribute.query(
'/attributeCollection/attribute[name/text() = "firstName"]/value/text()'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #5 (EAV Type, the "near bull's-eye" path):
select ANV.AttributeId as AttributeId
from dbo.AttributesNameValue ANV
where (ANV.Attribute.query(
'//attribute[name/text() = "firstName"]/value/text()'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #6 (EAV Type, the "in the neighbourhood of the bull's-eye" path):
select ANV.AttributeId as AttributeId
from dbo.AttributesNameValue ANV
where (ANV.Attribute.query(
'//value[../name/text() = "firstName"]/text()'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #7 (Strong Type, absolute path):
select AT.AttributeId as AttributeId
from dbo.AttributesTyped AT
where (AT.Attribute.query(
'/attributeCollection/firstName'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #8 (Strong Type, approximate path):
select AT.AttributeId as AttributeId
from dbo.AttributesTyped AT
where (AT.Attribute.query(
'//firstName'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #9 (Strong Path, the "ridiculously complicated" path):
select AT.AttributeId as AttributeId
from dbo.AttributesTyped AT
where (AT.Attribute.query(
'//*[local-name() = "firstName"]'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #10 (Strong Type, the "bulls's-eye" path):
select AT.AttributeId as AttributeId
from dbo.AttributesTyped AT
where (AT.Attribute.query(
'/attributeCollection/firstName/text()'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #11 (Strong Type, the "near bull's-eye" path):
select AT.AttributeId as AttributeId
from dbo.AttributesTyped AT
where (AT.Attribute.query(
'//firstName/text()'
).value(
'.'
,'varchar(max)'
) = @firstName)
- Query #12 (Strong Type, the "grasping at straws" path):
select AT.AttributeId as AttributeId
from dbo.AttributesTyped AT
where (AT.Attribute.query(
'//*[local-name() = "firstName"]/text()'
).value(
'.'
,'varchar(max)'
) = @firstName)