snaps & snippets: SQL Server 2005 XML Methods, Part Two, Retrieval Comparison

After a brief introduction to XML retrieval methods in SQL Server 2005 we can examine the behaviour of a few practical examples.

The implementation of XML as a native data type could not have been considered complete without XML indexes. Especially when considering data retrieval – indexing may just as well be the deciding element of the entire implementation. There are four types of XML indexes available: a primary XML index and three secondary XML indexes each of the latter with a specific intended purpose. Please, follow the link above (to Books Online) for all the details.

Originally the subject of this post was supposed to be how XML indexes improve the performance of XML retrieval methods. I say *originally* because after a lot of testing and playing around with different possibilities and different approaches I've come to the conclusion that if XML indexes should have had a positive impact on performance then there must have been something seriously wrong with my sample data, or perhaps I have yet to find the magic combination. Either way, I don't believe finding a solution to a problem as trivial as this really should be this difficult. Well, I'm not ready to give up. Yet.

Roll up the sleeves

First of all, execute this script in your favourite testing database. The script creates two tables and fills them up with XML data, assuming that the AdventureWorks database is accessible from your favourite testing database. Then, to spice things up, add these four functions. Spicy and XPathy.

I assume you always read through each script before executing it. ;)

What's in a table?

When executed the script above will build two tables (dbo.AttributesNameValue and dbo.AttributesTyped) and insert two types of XML into them from the Person.Contact table of the AdventureWorks database (i.e. customer attributes such as FirstName, MiddleName and LastName are encapsulated into a single XML per CustomerID).

In the script you'll notice the @step variable; its purpose is to limit the number of rows retrieved from the source table. I've set it to 100,000 which exceeds the actual number of rows, and you can reduce it appropriately – to play around with the size of the test table. I would be very interested in knowing whether anyone comes up with different results.

The difference between the tables is that the XML in the dbo.AttributesNameValue is based on an untyped (or at least very loosely typed) EAV model and the one in the dbo.AttributesTyped is typed. Here are the XML Schemas used:

One for the EAV XML stored in the dbo.AttributesNameValue.Attribute table; and
Another for the Typed XML stored in the dbo.AttributesTyped.Attribute table.

I'll start by being mean and ruthless, simply to illustrate what we're up against. The following table shows the performance of the original query against the Person.Contact table:

Query	Performance
	Primary Key Only		Covering Index
	CPU	Reads	CPU	Reads
#1	20	561	0	3
Basic	20	561	0	3
Basic	20	561	0	3
Basic Query	10	561	0	3
Basic	20	561	0	3
Basic	30	561	0	3

The performance of query #1 is pretty much what should be expected – no special burden on the engine even without the covering index. Mind you, all 19972 rows reside in 559 pages.

The situation is quite different once we query XML data. The first thing to consider in this case is the increased number of pages: 5136 pages for dbo.AttributesTyped and 9087 pages for dbo.AttributesNameValue (the number of pages has grown 9 and 16 times respectively). The other consideration is the fact that now we also need an appropriate XQuery to access the data encapsulated in the XML. I've constructed 10 queries (5 per each table) using 5 different approaches to querying XML (you can find each query in the footnotes of this article). These are the results of each query's performance:

Performance ^{(Primary Key Only)}
Query	EAV Schema		Query	Typed Schema
Query	CPU	Reads	Query	CPU	Reads
#2	35193	1825	#7	26819	966
#2	34890	1833	#7	30504	966
#2	34811	1823	#7	26138	966
#2	35591	1823	#7	24305	966
#2	35040	1823	#7	26278	966
#2	35632	1823	#7	26869	966
#3	21723	1824	#8	4454	966
#3	20960	1829	#8	4967	966
#3	21121	1823	#8	4356	966
#3	21782	1823	#8	4206	966
#3	22082	1823	#8	4366	966
#3	22672	1823	#8	4376	966
#4	21565	1818	#9	4502	960
#4	21040	1823	#9	4847	960
#4	20950	1817	#9	4547	960
#4	21701	1817	#9	4126	960
#4	22002	1817	#9	4426	960
#4	22131	1817	#9	4566	960
#5	74417	1820	#10	44344	962
#5	72485	1823	#10	45816	962
#5	73916	1819	#10	43923	962
#5	75068	1819	#10	43383	962
#5	75739	1819	#10	43964	962
#5	74878	1819	#10	44635	962
#6	25449	1820	#11	4676	962
#6	24495	1825	#11	4897	962
#6	24685	1819	#11	4587	962
#6	25517	1819	#11	4836	962
#6	26629	1819	#11	4616	962
#6	25918	1819	#11	4446	962

The loss of performance is blindingly obvious – a lot more reads are necessary to access the data, of course the problem remains the same: "return the ID and the LastName of the person with the FirstName 'Karen'".

What's in a query?

Allow me to explain what the functions actually do. There are four of them, two per each table (per XML Type).

One pair of functions uses the nodes XML retrieval method (used in queries #2, #3, #7 and #8);
While the other pair uses the query XML retrieval method (used in queries #5, #6, #10 and #11).

Each function accepts two parameters: the XML and the name of the element, and returns the value of the given element.

Queries #4 and #9 are 'function-free', using only the query XML retrieval method with appropriate XQueries to access the nodes.

What!? No XML indexes!?

Yes, in the first series of tests the only existing indexes were the tables' primary keys. Could that be the reason behind poor performance? Well, let's add those immediately.

Performance ^{(All XML indexes)}
Query	EAV Schema		Query	Typed Schema
Query	CPU	Reads	Query	CPU	Reads
#2	34876	1823	#7	27355	966
#2	35331	1823	#7	27009	966
#2	35100	1823	#7	27189	966
#2	34009	1823	#7	27550	966
#2	35240	1823	#7	27690	966
#2	34700	1823	#7	27339	966
#3	26330	977379	#8	7693	142280
#3	21110	687442	#8	7641	142280
#3	28792	1170670	#8	7491	142280
#3	21581	687442	#8	7771	142280
#3	30103	1170670	#8	7862	142280
#3	30063	1170670	#8	7701	142280
#4	22532	785248	#9	7765	142537
#4	22863	809326	#9	7781	142537
#4	21641	809326	#9	7801	142537
#4	21681	688934	#9	7671	142537
#4	23464	809326	#9	7851	142537
#4	23013	809326	#9	7721	142537
#5	74347	1819	#10	45836	962
#5	76040	1819	#10	45946	962
#5	72023	1819	#10	46046	962
#5	75338	1819	#10	45745	962
#5	74027	1819	#10	45435	962
#5	74307	1819	#10	46006	962
#6	23530	386552	#11	7919	142276
#6	25576	386552	#11	7481	142276
#6	22112	386552	#11	8062	142276
#6	23524	386552	#11	7982	142276
#6	23143	386552	#11	8061	142276
#6	23294	386552	#11	8011	142276

Dreaming of performance improvements? Keep on dreaming. With XML indexes even more reads are necessary to access data, yet – in the case of the EAV type – the CPU overhead is not all that significant compared to the case without XML indexes. The increase in the number of reads is due to the way XML indexing is implemented in SQL Server 2005: additional system tables are created to store the XML index data.

In other words: 419412 additional rows in 2786 pages for dbo.AttributesNameValue and 179748 additional rows in 1105 pages for dbo.AttributesTyped.

The number of reads in queries against the dbo.AttributesNameValue table where XML indexes are actually used (i.e. queries #3, #4 and #6) is 536, 432 and 212 times higher compared to the number of reads when only the table's primary key is used. The good thing is that the impact on the processor remains pretty much unchanged – the queries seem to take just as much time with or without the XML indexes.

There is a significant difference, however, with the queries against the dbo.AttributesTyped table. Here, the impact on the processor is approximately 1.7 times higher, while the number of reads (for queries #8, #9 and #11) is approximately 148 times higher compared to the number of reads when only the table's primary key is used.

Books Online suggest using the VALUE and the PATH XML indexes for these kinds of queries, but as I have found out (using this particular test data) the only XML index ever used was the PRIMARY XML index (at least none of the other indexes ever showed up on the execution plans); although the performance was best with all XML indexes. In fact, I have experienced the most positive impact on performance (the lowest number of reads) after adding the PROPERTY index, which may have something to do with the nillability of the MiddleName element in both XML types, but I'm not certain. Anyway, according to the execution plans, no secondary XML index was ever used by the optimizer.

Definitely a matter worth exploring further...

Weak or strong?

XML Schema Collections bring proper domain integrity to the XML data type, but is that all they bring? Let's add the XML Schemas for the two XML types, and after the XML Schema Collections have been created we need to drop all XML indexes prior to altering the two XML columns – the rule that indexed columns cannot be altered applies to XML columns as well. Before executing the queries re-create all XML indexes. Now we're ready to see whether this affects the performance of the queries in any way.

Performance ^{(All XML indexes)}
Query	EAV Schema		Query	Typed Schema
Query	CPU	Reads	Query	CPU	Reads
#2	43112	9115	#7	30424	5156
#2	41290	9115	#7	30024	5156
#2	41309	9115	#7	30324	5156
#2	41480	9115	#7	30654	5156
#2	46056	9115	#7	30614	5156
#2	45425	9115	#7	30504	5156
#3	14614	511056	#8	7220	145909
#3	16469	511056	#8	7050	145909
#3	13209	511056	#8	7200	145909
#3	13329	511056	#8	7411	145909
#3	14911	511056	#8	7110	145909
#3	15152	511056	#8	7330	145909
#4	13794	512012	#9	7096	146163
#4	13299	511980	#9	7060	146163
#4	13038	511980	#9	7241	146163
#4	13128	511980	#9	7050	146163
#4	15012	511980	#9	7030	146163
#4	14491	512140	#9	7101	146163
#5	77025	9111	#10	45956	5152
#5	74747	9111	#10	45816	5152
#5	74387	9111	#10	46246	5152
#5	74488	9111	#10	46517	5152
#5	82569	9111	#10	45906	5152
#5	78934	9111	#10	45295	5152
#6	16732	309900	#11	7241	145905
#6	16544	309900	#11	7110	145905
#6	16413	309900	#11	7341	145905
#6	16434	309900	#11	7401	145905
#6	17775	309900	#11	7111	145905
#6	16494	309900	#11	7240	145905

It seems that typing the XML enables the SQL Server engine to organize the XML indexes more efficiently, as a consequence the number of logical reads is reduced – that is, for queries where XML indexes are used (i.e. queries #3, #4, #6, #8, #9 and #11). While the number of logical reads is increased for the queries where XML indexes aren't used (i.e. queries #2, #5, #7 and #10).

Actually, both the number of logical reads as well as the stress on the CPU have dropped dramatically for the case of the EAV XML type, while the number of logical reads has increased slightly for the strongly typed XML, but at the same time the stress on the CPU has decreased. Apparently, typing does improve performance of retrieval methods. Again, this is a matter worthy of more detailed investigation.

Mental aggregation

We've compared the performance aspects of a few typical use-cases for the built-in XML retrieval methods. So, what is the pattern that we've experienced? Regarding XML in SQL Server several points have become apparent:

Queries against the XML data type are by far more resource-intensive than queries against data properly normalised in SQL;
XML indexes are very much different and behave quite differently from regular indexes: many more reads are required to access data;
SQL Server 2005 seems to "prefer" typed XML; nonetheless, the performance of queries against the XML data type will diminish as the complexity of the XML data increases;
We've seen proof once again of why functions on data columns in query restrictions should be avoided – with user-defined functions in the WHERE clause the performance was really very poor. However, the overall performance of queries *could* benefit from functions being used in the SELECT clause (as is the case with queries #3 and #6 vs. query #4 or queries #8 and #11 vs. #9), but this should never be taken as a general rule – it merely represents an alternative to consider.

I would have expected XML Queries to perform less efficiently than SQL queries, but judging from my test data the drop in performance can be quite dramatic. To be honest, I also did not expect XML Schema Collections to have a positive impact on retrieval performance; quite the opposite!

In one of the following posts I'll be focusing some more on XML index use; most importantly – I'll be discussing the execution plans which we haven't looked at yet. After all, this was supposed to be only a slightly more detailed introduction into the world of XML retrieval methods introduced with SQL Server 2005. Next up is a look at XPath expressions.

Query #1:

select Person.Contact.ContactID as ContactID
 ,Person.Contact.LastName as LastName
 from Person.Contact
 where (Person.Contact.FirstName = 'Karen')

The covering index for the Person.Contact table:

create nonclustered index x_Person_Contact_FullName
 on Person.Contact
  (
  FirstName
  ,MiddleName
  ,LastName
  )

Query #2:

select dbo.AttributesNameValue.AttributeId as AttributeId
 ,dbo.fnGet_Value_byAttributeName_fromNameValue(
  dbo.AttributesNameValue.Attribute, 'lastName'
  ) as LastName
 from dbo.AttributesNameValue
 where (dbo.fnGet_Value_byAttributeName_fromNameValue(
    dbo.AttributesNameValue.Attribute
    ,'firstName'
    ) = 'Karen')

Query #3:

select ANV.AttributeId as AttributeId
 ,dbo.fnGet_Value_byAttributeName_fromNameValue(
  ANV.Attribute
  ,'lastName'
  ) as LastName
 from dbo.AttributesNameValue ANV
 where (ANV.Attribute.query(
    'attributeCollection/attribute[name = "firstName"]/value'
    ).value(
     '.'
     ,'varchar(max)'
     ) = 'Karen')

Query #4:

select ANV.AttributeId as AttributeId
 ,ANV.Attribute.query(
  'attributeCollection/attribute[name = "lastName"]/value'
   ).value(
    '.'
    ,'varchar(max)'
    ) as LastName
 from dbo.AttributesNameValue ANV
 where (ANV.Attribute.query(
    'attributeCollection/attribute[name = "firstName"]/value'
    ).value(
     '.'
     ,'varchar(max)'
     ) = 'Karen')

Query #5:

select ANV.AttributeId as AttributeId
 ,dbo.fnGet_Value_byAttributeName_fromNameValue_Query(
  ANV.Attribute
  ,'lastName'
  ) as LastName
 from dbo.AttributesNameValue ANV
 where (dbo.fnGet_Value_byAttributeName_fromNameValue_Query(
    ANV.Attribute
    ,'firstName'
    ) = 'Karen')

Query #6:

select ANV.AttributeId as AttributeId
 ,dbo.fnGet_Value_byAttributeName_fromNameValue_Query(
  ANV.Attribute
  ,'lastName'
  ) as LastName
 from dbo.AttributesNameValue ANV
 where (ANV.Attribute.query(
    '/attributeCollection/attribute[name = "firstName"][1]/value'
    ).value(
     '.'
     ,'varchar(max)'
     ) = 'Karen')

Query #7:

select dbo.AttributesTyped.AttributeId as AttributeId
 ,dbo.fnGet_Value_byAttributeName_fromTyped(
  dbo.AttributesTyped.Attribute
  ,'lastName'
  ) as LastName
 from dbo.AttributesTyped
 where (dbo.fnGet_Value_byAttributeName_fromTyped(
    dbo.AttributesTyped.Attribute
    ,'firstName'
    ) = 'Karen')

Query #8:

select AT.AttributeId as AttributeId
 ,dbo.fnGet_Value_byAttributeName_fromTyped(
  AT.Attribute
  ,'lastName'
  ) as LastName
 from dbo.AttributesTyped AT
 where (AT.Attribute.query(
    'attributeCollection/firstName'
    ).value(
     '.'
     ,'varchar(max)'
     ) = 'Karen')

Query #9:

select AT.AttributeId as AttributeId
 ,AT.Attribute.query(
  'attributeCollection/lastName'
  ).value(
   '.'
   ,'varchar(max)'
   ) as LastName
 from dbo.AttributesTyped AT
 where (AT.Attribute.query(
    'attributeCollection/firstName'
    ).value(
     '.'
     ,'varchar(max)'
     ) = 'Karen')

Query #10:

select AT.AttributeId as AttributeId
 ,dbo.fnGet_Value_byAttributeName_fromTyped_Query(
  AT.Attribute
  ,'lastName'
  ) as LastName
 from dbo.AttributesTyped AT
 where (dbo.fnGet_Value_byAttributeName_fromTyped_Query(
    AT.Attribute
    ,'firstName'
    ) = 'Karen')

Query #11:

select AT.AttributeId as AttributeId
 ,dbo.fnGet_Value_byAttributeName_fromTyped_Query(
  AT.Attribute
  ,'lastName'
  ) as LastName
 from dbo.AttributesTyped AT
 where (AT.Attribute.query(
    'attributeCollection/firstName'
    ).value(
     '.'
     ,'varchar(max)'
     ) = 'Karen')

4 comments:

Guennadi Vanine said...: I am surprised that here is no feedback.

Does it mean that nobody uses XML in MS SQL Server?

Would you be interested to look into this discussion?
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1757316&SiteID=1&mode=1

Guennadi Vanine - Геннадий Ванин; 20 June, 2007 04:21
Matija Lah said...: AFAIK XML is being used, even in open-schema solutions: with a data core common to all cases properly normalized in the relational world, leaving the XML to handle marginal cases (i.e. indigenous to each individual case).

Case-sensitivity or better: the lack of XQuery/XPath support in the database engine is probably the biggest problem with XML in SQL at this time.

Then again - both standards have now become official W3C recommendations, which means that we should expect their full implementation in a next version of SQL Server.

ML; 20 June, 2007 10:10
Anonymous said...: I know this is a bit late, so hopefully, someone is still listening.

I have been struggling with performance issues in queries against XML data in SQL Server 2005 for some time now. My situation involves a table with a handful of relational columnns plus a single XML column. The xml column is simple xml, but not types, because the elements of the xml can change quite dramatically. My xml looks something like this...

<data>
<first_name>John</first_name>
<last_name>John</last_name>
<street>John</street>
<city>John</city>
ETC...
</data>

But, with many more elements, sometimes upwards of 350.

The queries I am attempting to run look like this...

SELECT
ID,
SomeOtherCol
FROM MyTable
WHERE
MyXmlCol.value('(/data/last_name/text())[1]', 'nvarchar(100)') like '%Doe%'
ORDER BY
MyXmlCol.value('(/data/first_name/text())[1]', 'nvarchar(100)')

Any time I use the xml col in the where clause or the order by clause, my queries become very slow and almost unusable in my scenarios.

I have tried all combination of indexes on the xml col, but no improvement is seen.

I had heard a rumor online (1 post) that Sql Server 2005 had a bug with indexes on xml columns that resulted in slower performance than by not using the indexes. Could not find any other information on this.

Anyway, I thought that because my queries into the xml used a simple xpath to get to an element (which by the way would be unique in the xml col), I would be able to get better performance.

If anyone has any thoughts or guidance, that would be helpful!

Kevin; 10 January, 2009 04:33
Matija Lah said...: Kevin,

I see you've also posted this question in the microsoft.public.sqlserver.programming news group.

Let's continue the discussion there. Ok?

ML; 13 January, 2009 13:38

Tuesday, January 09, 2007

SQL Server 2005 XML Methods, Part Two, Retrieval Comparison