After a brief introduction to XML retrieval methods in SQL Server 2005 we can examine the behaviour of a few practical examples.
The implementation of XML as a native data type could not have been considered complete without XML indexes. Especially when considering data retrieval – indexing may just as well be the deciding element of the entire implementation. There are four types of XML indexes available: a primary XML index and three secondary XML indexes each of the latter with a specific intended purpose. Please, follow the link above (to Books Online) for all the details.
Originally the subject of this post was supposed to be how XML indexes improve the performance of XML retrieval methods. I say *originally* because after a lot of testing and playing around with different possibilities and different approaches I've come to the conclusion that if XML indexes should have had a positive impact on performance then there must have been something seriously wrong with my sample data, or perhaps I have yet to find the magic combination. Either way, I don't believe finding a solution to a problem as trivial as this really should be this difficult. Well, I'm not ready to give up. Yet.
Roll up the sleeves
First of all, execute this script in your favourite testing database. The script creates two tables and fills them up with XML data, assuming that the AdventureWorks database is accessible from your favourite testing database. Then, to spice things up, add these four functions. Spicy and XPathy.
I assume you always read through each script before executing it. ;)
What's in a table?
When executed the script above will build two tables (dbo.AttributesNameValue
and dbo.AttributesTyped
) and insert two types of XML into them from the Person.Contact
table of the AdventureWorks database (i.e. customer attributes such as FirstName
, MiddleName
and LastName
are encapsulated into a single XML per CustomerID
).
In the script you'll notice the @step
variable; its purpose is to limit the number of rows retrieved from the source table. I've set it to 100,000 which exceeds the actual number of rows, and you can reduce it appropriately – to play around with the size of the test table. I would be very interested in knowing whether anyone comes up with different results.
The difference between the tables is that the XML in the dbo.AttributesNameValue
is based on an untyped (or at least very loosely typed) EAV model and the one in the dbo.AttributesTyped
is typed. Here are the XML Schemas used:
- One for the EAV XML stored in the
dbo.AttributesNameValue.Attribute
table; and
- Another for the Typed XML stored in the
dbo.AttributesTyped.Attribute
table.
I'll start by being mean and ruthless, simply to illustrate what we're up against. The following table shows the performance of the original query against the Person.Contact
table:
Query |
Performance |
Primary Key Only |
Covering Index |
CPU |
Reads |
CPU |
Reads |
#1 |
20 |
561 |
0 |
3 |
Basic |
20 |
561 |
0 |
3 |
Basic |
20 |
561 |
0 |
3 |
Basic Query |
10 |
561 |
0 |
3 |
Basic |
20 |
561 |
0 |
3 |
Basic |
30 |
561 |
0 |
3 |
The performance of query #1 is pretty much what should be expected – no special burden on the engine even without the covering index. Mind you, all 19972 rows reside in 559 pages.
The situation is quite different once we query XML data. The first thing to consider in this case is the increased number of pages: 5136 pages for dbo.AttributesTyped
and 9087 pages for dbo.AttributesNameValue
(the number of pages has grown 9 and 16 times respectively). The other consideration is the fact that now we also need an appropriate XQuery to access the data encapsulated in the XML. I've constructed 10 queries (5 per each table) using 5 different approaches to querying XML (you can find each query in the footnotes of this article). These are the results of each query's performance:
Performance (Primary Key Only) |
Query |
EAV Schema |
|
Query |
Typed Schema |
CPU |
Reads |
|
CPU |
Reads |
#2 |
35193 |
1825 |
|
#7 |
26819 |
966 |
#2 |
34890 |
1833 |
|
#7 |
30504 |
966 |
#2 |
34811 |
1823 |
|
#7 |
26138 |
966 |
#2 |
35591 |
1823 |
|
#7 |
24305 |
966 |
#2 |
35040 |
1823 |
|
#7 |
26278 |
966 |
#2 |
35632 |
1823 |
|
#7 |
26869 |
966 |
#3 |
21723 |
1824 |
|
#8 |
4454 |
966 |
#3 |
20960 |
1829 |
|
#8 |
4967 |
966 |
#3 |
21121 |
1823 |
|
#8 |
4356 |
966 |
#3 |
21782 |
1823 |
|
#8 |
4206 |
966 |
#3 |
22082 |
1823 |
|
#8 |
4366 |
966 |
#3 |
22672 |
1823 |
|
#8 |
4376 |
966 |
#4 |
21565 |
1818 |
|
#9 |
4502 |
960 |
#4 |
21040 |
1823 |
|
#9 |
4847 |
960 |
#4 |
20950 |
1817 |
|
#9 |
4547 |
960 |
#4 |
21701 |
1817 |
|
#9 |
4126 |
960 |
#4 |
22002 |
1817 |
|
#9 |
4426 |
960 |
#4 |
22131 |
1817 |
|
#9 |
4566 |
960 |
#5 |
74417 |
1820 |
|
#10 |
44344 |
962 |
#5 |
72485 |
1823 |
|
#10 |
45816 |
962 |
#5 |
73916 |
1819 |
|
#10 |
43923 |
962 |
#5 |
75068 |
1819 |
|
#10 |
43383 |
962 |
#5 |
75739 |
1819 |
|
#10 |
43964 |
962 |
#5 |
74878 |
1819 |
|
#10 |
44635 |
962 |
#6 |
25449 |
1820 |
|
#11 |
4676 |
962 |
#6 |
24495 |
1825 |
|
#11 |
4897 |
962 |
#6 |
24685 |
1819 |
|
#11 |
4587 |
962 |
#6 |
25517 |
1819 |
|
#11 |
4836 |
962 |
#6 |
26629 |
1819 |
|
#11 |
4616 |
962 |
#6 |
25918 |
1819 |
|
#11 |
4446 |
962 |
The loss of performance is blindingly obvious – a lot more reads are necessary to access the data, of course the problem remains the same: "return the ID and the LastName of the person with the FirstName 'Karen'".
What's in a query?
Allow me to explain what the functions actually do. There are four of them, two per each table (per XML Type).
- One pair of functions uses the
nodes
XML retrieval method (used in queries #2, #3, #7 and #8);
- While the other pair uses the
query
XML retrieval method (used in queries #5, #6, #10 and #11).
Each function accepts two parameters: the XML and the name of the element, and returns the value of the given element.
Queries #4 and #9 are 'function-free', using only the query
XML retrieval method with appropriate XQueries to access the nodes.
What!? No XML indexes!?
Yes, in the first series of tests the only existing indexes were the tables' primary keys. Could that be the reason behind poor performance? Well, let's add those immediately.
Performance (All XML indexes) |
Query |
EAV Schema |
|
Query |
Typed Schema |
CPU |
Reads |
|
CPU |
Reads |
#2 |
34876 |
1823 |
|
#7 |
27355 |
966 |
#2 |
35331 |
1823 |
|
#7 |
27009 |
966 |
#2 |
35100 |
1823 |
|
#7 |
27189 |
966 |
#2 |
34009 |
1823 |
|
#7 |
27550 |
966 |
#2 |
35240 |
1823 |
|
#7 |
27690 |
966 |
#2 |
34700 |
1823 |
|
#7 |
27339 |
966 |
#3 |
26330 |
977379 |
|
#8 |
7693 |
142280 |
#3 |
21110 |
687442 |
|
#8 |
7641 |
142280 |
#3 |
28792 |
1170670 |
|
#8 |
7491 |
142280 |
#3 |
21581 |
687442 |
|
#8 |
7771 |
142280 |
#3 |
30103 |
1170670 |
|
#8 |
7862 |
142280 |
#3 |
30063 |
1170670 |
|
#8 |
7701 |
142280 |
#4 |
22532 |
785248 |
|
#9 |
7765 |
142537 |
#4 |
22863 |
809326 |
|
#9 |
7781 |
142537 |
#4 |
21641 |
809326 |
|
#9 |
7801 |
142537 |
#4 |
21681 |
688934 |
|
#9 |
7671 |
142537 |
#4 |
23464 |
809326 |
|
#9 |
7851 |
142537 |
#4 |
23013 |
809326 |
|
#9 |
7721 |
142537 |
#5 |
74347 |
1819 |
|
#10 |
45836 |
962 |
#5 |
76040 |
1819 |
|
#10 |
45946 |
962 |
#5 |
72023 |
1819 |
|
#10 |
46046 |
962 |
#5 |
75338 |
1819 |
|
#10 |
45745 |
962 |
#5 |
74027 |
1819 |
|
#10 |
45435 |
962 |
#5 |
74307 |
1819 |
|
#10 |
46006 |
962 |
#6 |
23530 |
386552 |
|
#11 |
7919 |
142276 |
#6 |
25576 |
386552 |
|
#11 |
7481 |
142276 |
#6 |
22112 |
386552 |
|
#11 |
8062 |
142276 |
#6 |
23524 |
386552 |
|
#11 |
7982 |
142276 |
#6 |
23143 |
386552 |
|
#11 |
8061 |
142276 |
#6 |
23294 |
386552 |
|
#11 |
8011 |
142276 |
Dreaming of performance improvements? Keep on dreaming. With XML indexes even more reads are necessary to access data, yet – in the case of the EAV type – the CPU overhead is not all that significant compared to the case without XML indexes. The increase in the number of reads is due to the way XML indexing is implemented in SQL Server 2005: additional system tables are created to store the XML index data.
In other words: 419412 additional rows in 2786 pages for dbo.AttributesNameValue
and 179748 additional rows in 1105 pages for dbo.AttributesTyped
.
The number of reads in queries against the dbo.AttributesNameValue
table where XML indexes are actually used (i.e. queries #3, #4 and #6) is 536, 432 and 212 times higher compared to the number of reads when only the table's primary key is used. The good thing is that the impact on the processor remains pretty much unchanged – the queries seem to take just as much time with or without the XML indexes.
There is a significant difference, however, with the queries against the dbo.AttributesTyped
table. Here, the impact on the processor is approximately 1.7 times higher, while the number of reads (for queries #8, #9 and #11) is approximately 148 times higher compared to the number of reads when only the table's primary key is used.
Books Online suggest using the VALUE and the PATH XML indexes for these kinds of queries, but as I have found out (using this particular test data) the only XML index ever used was the PRIMARY XML index (at least none of the other indexes ever showed up on the execution plans); although the performance was best with all XML indexes. In fact, I have experienced the most positive impact on performance (the lowest number of reads) after adding the PROPERTY index, which may have something to do with the nillability of the MiddleName element in both XML types, but I'm not certain. Anyway, according to the execution plans, no secondary XML index was ever used by the optimizer.
Definitely a matter worth exploring further...
Weak or strong?
XML Schema Collections bring proper domain integrity to the XML data type, but is that all they bring? Let's add the XML Schemas for the two XML types, and after the XML Schema Collections have been created we need to drop all XML indexes prior to altering the two XML columns – the rule that indexed columns cannot be altered applies to XML columns as well. Before executing the queries re-create all XML indexes. Now we're ready to see whether this affects the performance of the queries in any way.
Performance (All XML indexes) |
Query |
EAV Schema |
|
Query |
Typed Schema |
CPU |
Reads |
|
CPU |
Reads |
#2 |
43112 |
9115 |
|
#7 |
30424 |
5156 |
#2 |
41290 |
9115 |
|
#7 |
30024 |
5156 |
#2 |
41309 |
9115 |
|
#7 |
30324 |
5156 |
#2 |
41480 |
9115 |
|
#7 |
30654 |
5156 |
#2 |
46056 |
9115 |
|
#7 |
30614 |
5156 |
#2 |
45425 |
9115 |
|
#7 |
30504 |
5156 |
#3 |
14614 |
511056 |
|
#8 |
7220 |
145909 |
#3 |
16469 |
511056 |
|
#8 |
7050 |
145909 |
#3 |
13209 |
511056 |
|
#8 |
7200 |
145909 |
#3 |
13329 |
511056 |
|
#8 |
7411 |
145909 |
#3 |
14911 |
511056 |
|
#8 |
7110 |
145909 |
#3 |
15152 |
511056 |
|
#8 |
7330 |
145909 |
#4 |
13794 |
512012 |
|
#9 |
7096 |
146163 |
#4 |
13299 |
511980 |
|
#9 |
7060 |
146163 |
#4 |
13038 |
511980 |
|
#9 |
7241 |
146163 |
#4 |
13128 |
511980 |
|
#9 |
7050 |
146163 |
#4 |
15012 |
511980 |
|
#9 |
7030 |
146163 |
#4 |
14491 |
512140 |
|
#9 |
7101 |
146163 |
#5 |
77025 |
9111 |
|
#10 |
45956 |
5152 |
#5 |
74747 |
9111 |
|
#10 |
45816 |
5152 |
#5 |
74387 |
9111 |
|
#10 |
46246 |
5152 |
#5 |
74488 |
9111 |
|
#10 |
46517 |
5152 |
#5 |
82569 |
9111 |
|
#10 |
45906 |
5152 |
#5 |
78934 |
9111 |
|
#10 |
45295 |
5152 |
#6 |
16732 |
309900 |
|
#11 |
7241 |
145905 |
#6 |
16544 |
309900 |
|
#11 |
7110 |
145905 |
#6 |
16413 |
309900 |
|
#11 |
7341 |
145905 |
#6 |
16434 |
309900 |
|
#11 |
7401 |
145905 |
#6 |
17775 |
309900 |
|
#11 |
7111 |
145905 |
#6 |
16494 |
309900 |
|
#11 |
7240 |
145905 |
It seems that typing the XML enables the SQL Server engine to organize the XML indexes more efficiently, as a consequence the number of logical reads is reduced – that is, for queries where XML indexes are used (i.e. queries #3, #4, #6, #8, #9 and #11). While the number of logical reads is increased for the queries where XML indexes aren't used (i.e. queries #2, #5, #7 and #10).
Actually, both the number of logical reads as well as the stress on the CPU have dropped dramatically for the case of the EAV XML type, while the number of logical reads has increased slightly for the strongly typed XML, but at the same time the stress on the CPU has decreased. Apparently, typing does improve performance of retrieval methods. Again, this is a matter worthy of more detailed investigation.
Mental aggregation
We've compared the performance aspects of a few typical use-cases for the built-in XML retrieval methods. So, what is the pattern that we've experienced? Regarding XML in SQL Server several points have become apparent:
- Queries against the XML data type are by far more resource-intensive than queries against data properly normalised in SQL;
- XML indexes are very much different and behave quite differently from regular indexes: many more reads are required to access data;
- SQL Server 2005 seems to "prefer" typed XML; nonetheless, the performance of queries against the XML data type will diminish as the complexity of the XML data increases;
- We've seen proof once again of why functions on data columns in query restrictions should be avoided – with user-defined functions in the
WHERE
clause the performance was really very poor. However, the overall performance of queries *could* benefit from functions being used in the SELECT
clause (as is the case with queries #3 and #6 vs. query #4 or queries #8 and #11 vs. #9), but this should never be taken as a general rule – it merely represents an alternative to consider.
I would have expected XML Queries to perform less efficiently than SQL queries, but judging from my test data the drop in performance can be quite dramatic. To be honest, I also did not expect XML Schema Collections to have a positive impact on retrieval performance; quite the opposite!
In one of the following posts I'll be focusing some more on XML index use; most importantly – I'll be discussing the execution plans which we haven't looked at yet. After all, this was supposed to be only a slightly more detailed introduction into the world of XML retrieval methods introduced with SQL Server 2005. Next up is a look at XPath expressions.
ML
- Query #1:
select Person.Contact.ContactID as ContactID
,Person.Contact.LastName as LastName
from Person.Contact
where (Person.Contact.FirstName = 'Karen')
- The covering index for the
Person.Contact
table:
create nonclustered index x_Person_Contact_FullName
on Person.Contact
(
FirstName
,MiddleName
,LastName
)
- Query #2:
select dbo.AttributesNameValue.AttributeId as AttributeId
,dbo.fnGet_Value_byAttributeName_fromNameValue(
dbo.AttributesNameValue.Attribute, 'lastName'
) as LastName
from dbo.AttributesNameValue
where (dbo.fnGet_Value_byAttributeName_fromNameValue(
dbo.AttributesNameValue.Attribute
,'firstName'
) = 'Karen')
- Query #3:
select ANV.AttributeId as AttributeId
,dbo.fnGet_Value_byAttributeName_fromNameValue(
ANV.Attribute
,'lastName'
) as LastName
from dbo.AttributesNameValue ANV
where (ANV.Attribute.query(
'attributeCollection/attribute[name = "firstName"]/value'
).value(
'.'
,'varchar(max)'
) = 'Karen')
- Query #4:
select ANV.AttributeId as AttributeId
,ANV.Attribute.query(
'attributeCollection/attribute[name = "lastName"]/value'
).value(
'.'
,'varchar(max)'
) as LastName
from dbo.AttributesNameValue ANV
where (ANV.Attribute.query(
'attributeCollection/attribute[name = "firstName"]/value'
).value(
'.'
,'varchar(max)'
) = 'Karen')
- Query #5:
select ANV.AttributeId as AttributeId
,dbo.fnGet_Value_byAttributeName_fromNameValue_Query(
ANV.Attribute
,'lastName'
) as LastName
from dbo.AttributesNameValue ANV
where (dbo.fnGet_Value_byAttributeName_fromNameValue_Query(
ANV.Attribute
,'firstName'
) = 'Karen')
- Query #6:
select ANV.AttributeId as AttributeId
,dbo.fnGet_Value_byAttributeName_fromNameValue_Query(
ANV.Attribute
,'lastName'
) as LastName
from dbo.AttributesNameValue ANV
where (ANV.Attribute.query(
'/attributeCollection/attribute[name = "firstName"][1]/value'
).value(
'.'
,'varchar(max)'
) = 'Karen')
- Query #7:
select dbo.AttributesTyped.AttributeId as AttributeId
,dbo.fnGet_Value_byAttributeName_fromTyped(
dbo.AttributesTyped.Attribute
,'lastName'
) as LastName
from dbo.AttributesTyped
where (dbo.fnGet_Value_byAttributeName_fromTyped(
dbo.AttributesTyped.Attribute
,'firstName'
) = 'Karen')
- Query #8:
select AT.AttributeId as AttributeId
,dbo.fnGet_Value_byAttributeName_fromTyped(
AT.Attribute
,'lastName'
) as LastName
from dbo.AttributesTyped AT
where (AT.Attribute.query(
'attributeCollection/firstName'
).value(
'.'
,'varchar(max)'
) = 'Karen')
- Query #9:
select AT.AttributeId as AttributeId
,AT.Attribute.query(
'attributeCollection/lastName'
).value(
'.'
,'varchar(max)'
) as LastName
from dbo.AttributesTyped AT
where (AT.Attribute.query(
'attributeCollection/firstName'
).value(
'.'
,'varchar(max)'
) = 'Karen')
- Query #10:
select AT.AttributeId as AttributeId
,dbo.fnGet_Value_byAttributeName_fromTyped_Query(
AT.Attribute
,'lastName'
) as LastName
from dbo.AttributesTyped AT
where (dbo.fnGet_Value_byAttributeName_fromTyped_Query(
AT.Attribute
,'firstName'
) = 'Karen')
- Query #11:
select AT.AttributeId as AttributeId
,dbo.fnGet_Value_byAttributeName_fromTyped_Query(
AT.Attribute
,'lastName'
) as LastName
from dbo.AttributesTyped AT
where (AT.Attribute.query(
'attributeCollection/firstName'
).value(
'.'
,'varchar(max)'
) = 'Karen')