Wednesday, January 24, 2007

XQuery 1.0 becomes a W3C Recommendation

On January 23rd 2007 the World Wide Web Consortium announced that XQuery 1.0, XPath 2.0 and XSLT 2.0 are now official W3C Recommendations, meaning that the R&D activities regarding the current versions have come to a conclusion and that all three standards are now ready for implementation. On the other hand this also means that the work on the next versions has started – I believe the XML Query Working Group is now counting on feedback from implementors in order to improve the standards.

I've posted on this subject before, when the Proposed Recommendations were announced.

The specifications are available at the following locations:

Additional documents are available at the following locations:

As far as the XQuery/XPath functions are concerned the collection remains unchanged.

The R&D activities concerning XQuery 1.0 and XPath 2.0 Full-Text language extensions are also well under way.

I guess we should now expect some changes to the SQL Server implementation of the standards, perhaps not in a service release but certainly in the next full version of the platform.


ML

Tuesday, January 09, 2007

SQL Server 2005 XML Methods, Part Two, Retrieval Comparison

After a brief introduction to XML retrieval methods in SQL Server 2005 we can examine the behaviour of a few practical examples.

The implementation of XML as a native data type could not have been considered complete without XML indexes. Especially when considering data retrieval – indexing may just as well be the deciding element of the entire implementation. There are four types of XML indexes available: a primary XML index and three secondary XML indexes each of the latter with a specific intended purpose. Please, follow the link above (to Books Online) for all the details.

Originally the subject of this post was supposed to be how XML indexes improve the performance of XML retrieval methods. I say *originally* because after a lot of testing and playing around with different possibilities and different approaches I've come to the conclusion that if XML indexes should have had a positive impact on performance then there must have been something seriously wrong with my sample data, or perhaps I have yet to find the magic combination. Either way, I don't believe finding a solution to a problem as trivial as this really should be this difficult. Well, I'm not ready to give up. Yet.


Roll up the sleeves

First of all, execute this script in your favourite testing database. The script creates two tables and fills them up with XML data, assuming that the AdventureWorks database is accessible from your favourite testing database. Then, to spice things up, add these four functions. Spicy and XPathy.

I assume you always read through each script before executing it. ;)


What's in a table?

When executed the script above will build two tables (dbo.AttributesNameValue and dbo.AttributesTyped) and insert two types of XML into them from the Person.Contact table of the AdventureWorks database (i.e. customer attributes such as FirstName, MiddleName and LastName are encapsulated into a single XML per CustomerID).

In the script you'll notice the @step variable; its purpose is to limit the number of rows retrieved from the source table. I've set it to 100,000 which exceeds the actual number of rows, and you can reduce it appropriately – to play around with the size of the test table. I would be very interested in knowing whether anyone comes up with different results.

The difference between the tables is that the XML in the dbo.AttributesNameValue is based on an untyped (or at least very loosely typed) EAV model and the one in the dbo.AttributesTyped is typed. Here are the XML Schemas used:

  • One for the EAV XML stored in the dbo.AttributesNameValue.Attribute table; and
  • Another for the Typed XML stored in the dbo.AttributesTyped.Attribute table.

I'll start by being mean and ruthless, simply to illustrate what we're up against. The following table shows the performance of the original query against the Person.Contact table:

Query Performance
Primary Key Only Covering Index
CPU Reads CPU Reads
#1 20 561 0 3

The performance of query #1 is pretty much what should be expected – no special burden on the engine even without the covering index. Mind you, all 19972 rows reside in 559 pages.

The situation is quite different once we query XML data. The first thing to consider in this case is the increased number of pages: 5136 pages for dbo.AttributesTyped and 9087 pages for dbo.AttributesNameValue (the number of pages has grown 9 and 16 times respectively). The other consideration is the fact that now we also need an appropriate XQuery to access the data encapsulated in the XML. I've constructed 10 queries (5 per each table) using 5 different approaches to querying XML (you can find each query in the footnotes of this article). These are the results of each query's performance:

Performance
(Primary Key Only)
Query EAV Schema Query Typed Schema
CPU Reads CPU Reads
#2 35193 1825 #7 26819 966
#3 21723 1824 #8 4454 966
#4 21565 1818 #9 4502 960
#5 74417 1820 #10 44344 962
#6 25449 1820 #11 4676 962

The loss of performance is blindingly obvious – a lot more reads are necessary to access the data, of course the problem remains the same: "return the ID and the LastName of the person with the FirstName 'Karen'".


What's in a query?

Allow me to explain what the functions actually do. There are four of them, two per each table (per XML Type).

  • One pair of functions uses the nodes XML retrieval method (used in queries #2, #3, #7 and #8);
  • While the other pair uses the query XML retrieval method (used in queries #5, #6, #10 and #11).

Each function accepts two parameters: the XML and the name of the element, and returns the value of the given element.

Queries #4 and #9 are 'function-free', using only the query XML retrieval method with appropriate XQueries to access the nodes.


What!? No XML indexes!?

Yes, in the first series of tests the only existing indexes were the tables' primary keys. Could that be the reason behind poor performance? Well, let's add those immediately.

Performance
(All XML indexes)
Query EAV Schema Query Typed Schema
CPU Reads CPU Reads
#2 34876 1823 #7 27355 966
#3 26330 977379 #8 7693 142280
#4 22532 785248 #9 7765 142537
#5 74347 1819 #10 45836 962
#6 23530 386552 #11 7919 142276

Dreaming of performance improvements? Keep on dreaming. With XML indexes even more reads are necessary to access data, yet – in the case of the EAV type – the CPU overhead is not all that significant compared to the case without XML indexes. The increase in the number of reads is due to the way XML indexing is implemented in SQL Server 2005: additional system tables are created to store the XML index data.

In other words: 419412 additional rows in 2786 pages for dbo.AttributesNameValue and 179748 additional rows in 1105 pages for dbo.AttributesTyped.

The number of reads in queries against the dbo.AttributesNameValue table where XML indexes are actually used (i.e. queries #3, #4 and #6) is 536, 432 and 212 times higher compared to the number of reads when only the table's primary key is used. The good thing is that the impact on the processor remains pretty much unchanged – the queries seem to take just as much time with or without the XML indexes.

There is a significant difference, however, with the queries against the dbo.AttributesTyped table. Here, the impact on the processor is approximately 1.7 times higher, while the number of reads (for queries #8, #9 and #11) is approximately 148 times higher compared to the number of reads when only the table's primary key is used.

Books Online suggest using the VALUE and the PATH XML indexes for these kinds of queries, but as I have found out (using this particular test data) the only XML index ever used was the PRIMARY XML index (at least none of the other indexes ever showed up on the execution plans); although the performance was best with all XML indexes. In fact, I have experienced the most positive impact on performance (the lowest number of reads) after adding the PROPERTY index, which may have something to do with the nillability of the MiddleName element in both XML types, but I'm not certain. Anyway, according to the execution plans, no secondary XML index was ever used by the optimizer.

Definitely a matter worth exploring further...


Weak or strong?

XML Schema Collections bring proper domain integrity to the XML data type, but is that all they bring? Let's add the XML Schemas for the two XML types, and after the XML Schema Collections have been created we need to drop all XML indexes prior to altering the two XML columns – the rule that indexed columns cannot be altered applies to XML columns as well. Before executing the queries re-create all XML indexes. Now we're ready to see whether this affects the performance of the queries in any way.

Performance
(All XML indexes)
Query EAV Schema Query Typed Schema
CPU Reads CPU Reads
#2 43112 9115 #7 30424 5156
#3 14614 511056 #8 7220 145909
#4 13794 512012 #9 7096 146163
#5 77025 9111 #10 45956 5152
#6 16732 309900 #11 7241 145905

It seems that typing the XML enables the SQL Server engine to organize the XML indexes more efficiently, as a consequence the number of logical reads is reduced – that is, for queries where XML indexes are used (i.e. queries #3, #4, #6, #8, #9 and #11). While the number of logical reads is increased for the queries where XML indexes aren't used (i.e. queries #2, #5, #7 and #10).

Actually, both the number of logical reads as well as the stress on the CPU have dropped dramatically for the case of the EAV XML type, while the number of logical reads has increased slightly for the strongly typed XML, but at the same time the stress on the CPU has decreased. Apparently, typing does improve performance of retrieval methods. Again, this is a matter worthy of more detailed investigation.


Mental aggregation

We've compared the performance aspects of a few typical use-cases for the built-in XML retrieval methods. So, what is the pattern that we've experienced? Regarding XML in SQL Server several points have become apparent:

  1. Queries against the XML data type are by far more resource-intensive than queries against data properly normalised in SQL;
  2. XML indexes are very much different and behave quite differently from regular indexes: many more reads are required to access data;
  3. SQL Server 2005 seems to "prefer" typed XML; nonetheless, the performance of queries against the XML data type will diminish as the complexity of the XML data increases;
  4. We've seen proof once again of why functions on data columns in query restrictions should be avoided – with user-defined functions in the WHERE clause the performance was really very poor. However, the overall performance of queries *could* benefit from functions being used in the SELECT clause (as is the case with queries #3 and #6 vs. query #4 or queries #8 and #11 vs. #9), but this should never be taken as a general rule – it merely represents an alternative to consider.

I would have expected XML Queries to perform less efficiently than SQL queries, but judging from my test data the drop in performance can be quite dramatic. To be honest, I also did not expect XML Schema Collections to have a positive impact on retrieval performance; quite the opposite!

In one of the following posts I'll be focusing some more on XML index use; most importantly – I'll be discussing the execution plans which we haven't looked at yet. After all, this was supposed to be only a slightly more detailed introduction into the world of XML retrieval methods introduced with SQL Server 2005. Next up is a look at XPath expressions.


ML


  • Query #1:
    select Person.Contact.ContactID as ContactID
     ,Person.Contact.LastName as LastName
     from Person.Contact
     where (Person.Contact.FirstName = 'Karen')
  • The covering index for the Person.Contact table:
    create nonclustered index x_Person_Contact_FullName
     on Person.Contact
      (
      FirstName
      ,MiddleName
      ,LastName
      )
  • Query #2:
    select dbo.AttributesNameValue.AttributeId as AttributeId
     ,dbo.fnGet_Value_byAttributeName_fromNameValue(
      dbo.AttributesNameValue.Attribute, 'lastName'
      ) as LastName
     from dbo.AttributesNameValue
     where (dbo.fnGet_Value_byAttributeName_fromNameValue(
        dbo.AttributesNameValue.Attribute
        ,'firstName'
        ) = 'Karen')
  • Query #3:
    select ANV.AttributeId as AttributeId
     ,dbo.fnGet_Value_byAttributeName_fromNameValue(
      ANV.Attribute
      ,'lastName'
      ) as LastName
     from dbo.AttributesNameValue ANV
     where (ANV.Attribute.query(
        'attributeCollection/attribute[name = "firstName"]/value'
        ).value(
         '.'
         ,'varchar(max)'
         ) = 'Karen')
  • Query #4:
    select ANV.AttributeId as AttributeId
     ,ANV.Attribute.query(
      'attributeCollection/attribute[name = "lastName"]/value'
       ).value(
        '.'
        ,'varchar(max)'
        ) as LastName
     from dbo.AttributesNameValue ANV
     where (ANV.Attribute.query(
        'attributeCollection/attribute[name = "firstName"]/value'
        ).value(
         '.'
         ,'varchar(max)'
         ) = 'Karen')
  • Query #5:
    select ANV.AttributeId as AttributeId
     ,dbo.fnGet_Value_byAttributeName_fromNameValue_Query(
      ANV.Attribute
      ,'lastName'
      ) as LastName
     from dbo.AttributesNameValue ANV
     where (dbo.fnGet_Value_byAttributeName_fromNameValue_Query(
        ANV.Attribute
        ,'firstName'
        ) = 'Karen')
  • Query #6:
    select ANV.AttributeId as AttributeId
     ,dbo.fnGet_Value_byAttributeName_fromNameValue_Query(
      ANV.Attribute
      ,'lastName'
      ) as LastName
     from dbo.AttributesNameValue ANV
     where (ANV.Attribute.query(
        '/attributeCollection/attribute[name = "firstName"][1]/value'
        ).value(
         '.'
         ,'varchar(max)'
         ) = 'Karen')
  • Query #7:
    select dbo.AttributesTyped.AttributeId as AttributeId
     ,dbo.fnGet_Value_byAttributeName_fromTyped(
      dbo.AttributesTyped.Attribute
      ,'lastName'
      ) as LastName
     from dbo.AttributesTyped
     where (dbo.fnGet_Value_byAttributeName_fromTyped(
        dbo.AttributesTyped.Attribute
        ,'firstName'
        ) = 'Karen')
  • Query #8:
    select AT.AttributeId as AttributeId
     ,dbo.fnGet_Value_byAttributeName_fromTyped(
      AT.Attribute
      ,'lastName'
      ) as LastName
     from dbo.AttributesTyped AT
     where (AT.Attribute.query(
        'attributeCollection/firstName'
        ).value(
         '.'
         ,'varchar(max)'
         ) = 'Karen')
  • Query #9:
    select AT.AttributeId as AttributeId
     ,AT.Attribute.query(
      'attributeCollection/lastName'
      ).value(
       '.'
       ,'varchar(max)'
       ) as LastName
     from dbo.AttributesTyped AT
     where (AT.Attribute.query(
        'attributeCollection/firstName'
        ).value(
         '.'
         ,'varchar(max)'
         ) = 'Karen')
  • Query #10:
    select AT.AttributeId as AttributeId
     ,dbo.fnGet_Value_byAttributeName_fromTyped_Query(
      AT.Attribute
      ,'lastName'
      ) as LastName
     from dbo.AttributesTyped AT
     where (dbo.fnGet_Value_byAttributeName_fromTyped_Query(
        AT.Attribute
        ,'firstName'
        ) = 'Karen')
  • Query #11:
    select AT.AttributeId as AttributeId
     ,dbo.fnGet_Value_byAttributeName_fromTyped_Query(
      AT.Attribute
      ,'lastName'
      ) as LastName
     from dbo.AttributesTyped AT
     where (AT.Attribute.query(
        'attributeCollection/firstName'
        ).value(
         '.'
         ,'varchar(max)'
         ) = 'Karen')