In this uncertain world of ours one thing remains pretty certain.
2008 is definitely coming. The year, that is. ;)
Have a good one!
Ever since the problem with the missing iFilters for Office 2007 documents first arose, the official response by the vendor was to install Office 2007 (or at least Word 2007) on the server. I believe we all agree that this simply isn't good practice.
First of all, why would anyone need an end-user application such as Microsoft Word on a server?
Obviously Word would never even have to be started on the server (unless the DBA would use it for documenting ;), but the question of how it would compete with SQL Server services still needs to be considered if we're planning on putting it there. Ok, at least it deserves a thought.
Then there's the question of security. Does MS Word 2007 with all of its modern data-sharing-collaborating-globally functionalities provide enough security to even be "allowed" to share the machine with SQL Server in a production environment? I'm not saying Word is IIS, I'm simply saying that the security aspect needs proper consideration. Remember all those Office-related security issues from previous versions?
Frankly, I see no reason for any end-user application (together with several server applications) to exist on the same machine with SQL Server. Another battle for resources is something any DBA should prevent. Ok, maybe we can have SSMS on the server, but that one's really more an integral part of the package than just another application. ;)
Anyway, all these questions have since become irrelevant, as Microsoft just made the iFilters available as a stand-alone download.
The files can be found here:
...and the corresponding KB article can be found here:
Now go, and add support for Microsoft Office 2007 documents to your SQL Server 2005's Full-text Search.
Once again it's time to update the local copy of SQL Server 2005 Books Online.
It's the end of November and the September 2007 Update has just been made available for download. The files are available through Windows Update, but in case you've missed them (or want full control over the installation) you can also find them here:
The latest version of the
AdventureWorks database is available on CodePlex:
I've been using SQL Server Analysis Services for large data analysis (analyzing texts) recently, and I've encountered an intriguing problem, the investigation of which has led me to some unexpected discoveries (and a workaround).
To reproduce the problem start SQL Server Business Intelligence Studio, then follow these 6 simple steps:
select sys.objects.[name] as ObjectName ,sys.objects.[object_id] as ObjectId ,object_definition(sys.objects.[object_id]) as ObjectDefinition from sys.objects where (sys.objects.[type] not in ('F', 'IT', 'PK', 'S', 'SQ', 'U', 'UQ'))This query returns the definitions of certain database objects; its purpose in this case is to provide an OLAP dimension using the
nvarchar(max)data type for its members.
I should have said *attempt* to deploy the project. ;) Yes, the deployment fails with the following error message:
Errors in the back-end database access module. The size specified for a binding was too small, resulting in one or more column values being truncated.
Upon further investigation (i.e. by processing each object individually) we can see that our »Object Definitions« dimension is the source of the problem. It would be quite reasonable to expect this error if we had altered the data source view or the underlying objects after the dimensions had been created, and if this modification had resulted in an increase of the size of any of the data source view columns, as is the case in this issue reported on Microsoft Connect:
The problem reported by Pieter M. does not apply to our case, though, as we've only just created the objects. Let's look at the dimension's definition (right-click the »Object Definitions« dimension in Solution Explorer and select »View Code«) to continue our investigation. Look for the Attribute named »Object Definition« in the XML (
Dimension/Attributes/Attribute/Name). Under the
KeyColumn elements you can see the properties for the Object Definition column.
<Attribute dwd:design-time-name="28921594-bbde-434a-b855-8b76cccf1e8a"> <ID>Object Definition</ID> <Name>Object Definition</Name> <EstimatedCount>191</EstimatedCount> <KeyColumns> <KeyColumn dwd:design-time-name="0377ce84-9c52-40ac-8149-64f2cf826009"> <DataType>WChar</DataType> <Source xsi:type="ColumnBinding" dwd:design-time-name="2509b813-0e0f-40e5-aa48-317a60cfe881"> <TableID>ObjectDefinitions</TableID> <ColumnID>ObjectDefinition</ColumnID> </Source> </KeyColumn> </KeyColumns> <OrderBy>Key</OrderBy> <InstanceSelection>List</InstanceSelection> </Attribute>
DataType element we can see that the data type is correctly mapped (i.e.
WChar); what we cannot see is the
DataSize element. But, hold on, do we really need one? After all,
nvarchar(max) is supposed to be unlimited, isn't it? Well, it is limited to 2 GBytes (2,147,483,647 Bytes, to be exact).
How large is in fact the data in the Object Definition column? A quick check in SQL Server Management studio...
select max(datalength(object_definition(sys.objects.[object_id]))) as MaxSize from sys.objects where (sys.objects.[type] not in ('F', 'IT', 'PK', 'S', 'SQ', 'U', 'UQ'))
...reveals that the largest value is "only" 12,844 Bytes. Ok, let's take on the challenge, and add the following element to the dimension's definition...
...as a sibling following the
DataType element (as documented in Books Online). If we now process the dimension, the processing finishes successfully, and we can deploy the project. There it is!
So, if we can predict the size of our
[n]varchar data, we're all set. But what if we can't?
BTW, if you're thinking of explicitly setting the size of
[n]varchar data source view columns, bear in mind that even this won't work every time. For instance: casting the values to
nvarchar(4000) results in the value for the
DataSize element being set to 1,024 Bytes. How weird is that!?
I must admit I'm somewhat baffled by how the mapping between Database Engine and Analysis Services data types is implemented. After all, both the
nvarchar(max) as well as the
WChar data types are based on the
String data type (
System.String) of the .Net framework, which means that the upper limit of their sizes is supposed to be 2 GBytes. Why would the actual size then still need to be specified? And why isn't it then specified automatically or – dare I say it – assumed? And how can 1024 be equal to 4000?
I have tried specifying the maximum value of 2147483647 – since, one can't always predict the size of
nvarchar(max) data, don't you agree? Isn't that what the *MAX* is for? Anyway, I was greeted with this wonderful error message:
Memory error: Allocation failure : Not enough storage is available to process this command.
To respond to this particular exception the documentation suggests closing programs, reducing paging, adding more RAM etc., which all seem pretty absurd given the situation, but – as you probably know – it's just a generic operating system exception.
So, if we can't specify the said maximum value, what can we do? The most logical solution would be to analyze existing data and perhaps come up with an approximation of how large the data is going to grow in the future. Which is a fine strategy, until it breaks. Well, I've been playing around a bit with this option and have found what seems to be the actual upper limit for the
WChar data type. And it's a weird one, to say the least. Does the size of 163,315,555 Bytes mean anything to you?
Yes, here it is (for you to copy and paste to your project)...
...the actual limit that still allows the deployment (at least the deployment of this particular SSAS project).
WTF? If you have any clue, please, leave a comment. And if you can deploy the project using a larger value, please, provide a repro. I really would like to know.
So, why all the fuss? Who needs to analyze large pieces of text? Well, I do. :) Honestly, I've come across this little column truncation "inconvenience" merely by chance. I don't think the texts that I need to run analyses against would ever exceed a thousand Bytes in size but one never knows... And now I know they should never exceed 156 MBytes. :)
p.s. I've reported this issue on Connect as well, so, please, vote on it if you think it should be resolved.
SSAS: varchar/nvarchar column size incorrectly mapped between data source views and cube dimensions (#298085)
Take a look at the picture below.
Somewhere in that lovely mess, on one shred of paper – or more? – there is an eleven digit number that I need.
Luckily for me, I've managed to find all the pieces. And it only took about 20 minutes. :) Which is still a long time compared to about three seconds that it originally took to cause the problem...
»Become a better developer in six months.« Define 'better'. :)
(Self-)education is the first thing that comes to mind, and the majority of the tag-attack "victims" agree that reading and learning (and writing about it) directly lead to being better at development; myself included. However, the kind of literature that I consider important (in view of my current work) may have nothing or at least very little to do with software or database development. As some of you know, my principal calling is law and legal science, so if I plan on being better at SQL development at the moment then I'll have to brush up on these:
But the key (IMHO) to improving the professional skill-set lies elsewhere. »Staying hungry« is an old principle that I find highly beneficial to self-development.
What am I talking about? Allow me to illustrate. Every boxing coach will work hard at keeping "the champ" fit to fight and at or near his peak. But a good coach will also keep the boxer from reaching the peak of his skills *before* the big match – for instance: stopping a practice fight before time, changing sparring partners to distract the boxer, or doing other more or less annoying things. In other words: keeping the champ "hungry" as well as fit.
Law? Hunger? Boxing? What do these have to do with development and being better at it?
All I'm really trying to say is that spending time away from the workplace, away from work, and – most importantly – away from the PC screen actually provides a distraction that might seem insignificant or even counter-productive but is in effect a great way of getting the creative juices flowing. I'm not saying the brain should be switched off – not at all! Keep your brain active, but do something different. And NO! Watching TV simply won't do.
I see Adam is thinking along the same lines.
I've mentioned before (I believe it was in one of the newsgroups) that I keep a guitar next to my desk. What do *you* do while waiting for SSMS or SSIS to finish, especially with longer operations?
I'm not going to tag anyone (I tend not to be a menace ;), but I'd still be interested in what you think or what you'd do to become better at whatever you do professionally.
We've seen SQL Server 2005 XML data retrieval methods in previous posts. In particular we've seen how they perform compared to T-SQL retrieval methods, and – quite frankly – the results were somewhat disappointing.
There is an XML retrieval method that we haven't discussed yet. There are two reasons why not:
I'm of course referring to the
exist() XML method. Its purpose is to test whether a given XQuery expression against a given XML instance returns a node set or not.
The method accepts an XQuery expression as its sole parameter, and returns a scalar value:
You can find all the details in Books Online. Incidentally, have you tried the new SQL Server 2005 Books Online Scoped Search yet?
To provide a better understanding of what our example is aimed at here's the original query:
select Person.Contact.ContactID as ContactID from Person.Contact where (Person.Contact.MiddleName = @middleName)
Our objective is to retrieve a list of
ContactID values for contacts in the
Person.Contact table of the AdventureWorks database with »Z.« as their
One row corresponds to the given criteria (that's 1 row out of 19972 in 559 pages).
First of all, let's compare the performance of the query against the original table with only the clustered primary key and then with a covering index. The definition of the covering index is available in the footnotes of this post (just follow the links from the table).
|Primary Key Only||Covering Index|
Excellent. On both counts. Would anyone expect anything else on a table of this size?
Let's attempt at achieving our objective where the data actually resides. In Part Two of the series we've used this script to create two tables of customer data from the AdventureWorks database with each customer's first, middle and last names packed up in two different XML instances. Two tables will thus be created:
We'll be using two pairs of queries (one pair for each XML type): a pair where only the value of the sought node is parameterized, and another pair of queries with both the value as well as the name of the target node supplied through parameters. Once again, the queries are located at the bottom of this post. Oh, and you'll need this, too:
declare @elementName sysname declare @middleName varchar(8) set @elementName = 'middleName' set @middleName = 'Z.'
Now, witness the power of
|All XML Indexes|
Not bad at all! Mind you that we've supplied the sought value in the correct (upper) case – »z.« (lower-case) would not have returned anything.
Judging by the number of reads, data access is minimal – processing can stop as soon as the condition is met, and no data actually needs to be retrieved from the XML. With appropriate XML indexes the XML instances do not have to be accessed at all.
Observe, how XML indexes have been used in this case by looking at the execution plans:
NB: I use the following prefixes for XML index names:
pxx_*– primary XML index;
axx_*– secondary XML index for path;
vxx_*– secondary XML index for value; and
rxx_*– secondary XML index for property.
exist method operates in the XML world, which means that the value supplied to the XQuery expression must be in the correct case. Supplying node names should not present a problem in this regard because we can rely on the given XML model (i.e. using XML Schema is highly recommended – we do want domain integrity, don't we?). This, however, most probably won't be enough to help us supply the data in the correct case. So, what alternatives do we have if we decide on using XML to store data in our SQL Server 2005 database?
existmethod to tests where the case of the data is either unambiguous or irrelevant (i.e. only for numeric values);
valueXML method to extract data and test the conditions in T-SQL).
Let's forget about option 1 for a moment as it is as ridiculous as it sounds. Option 2 is not even possible at this time, since none of the appropriate XPath functions (
translate()) are currently supported by SQL Server 2005. This leaves us with options 3 and 4 – both come at a price.
Keeping a lower-case copy of each XML instance as suggested by option 3 will inevitably increase the size of the data store, so we must ask ourselves whether the increased performance of searching XML instances using the
exist XML method is worth the growth of the database. And there's also the question of maintenance: how will having to keep two copies of the same data affect inserts, updates and deletes (both the logic behind data modifications as well as the performance)?
The answers to both questions depend on the actual case, and may even in the near future become irrelevant if we consider the probability that the XPath/XQuery W3C Recommendations will be implemented in a future release of SQL Server (or be introduced through a service release).
Of course, we can always bring search out of XML and into the relational world, as suggested by option 4. Compare the following counters (and queries) with those presented earlier:
|All XML Indexes|
"Isn't that nice..." he mumbled to himself sarcastically. Case-insensitivity doesn't come cheap.
Care to compare execution plans?
Pay special attention to the size of data in each step (also signified by the weight of the arrows). Heavy stuff.
exist XML method has merit – heck, it even outperforms pure T-SQL (in the absence of appropriate indexes). Nevertheless, the nature of XML, namely its case-sensitivity, quite significantly limits the usability of this otherwise highly efficient method.
And – I must say this, because I always do – SQL Server seems to prefer Typed XML (compared to the EAV XML model).
Person.Contacttable of the AdventureWorks database:
create nonclustered index x_Person_Contact_MiddleName on Person.Contact ( MiddleName )
select ANV.AttributeId from dbo.AttributesNameValue ANV where (ANV.Attribute.exist( '/attributeCollection/attribute[data(name) = "middleName"] /value[data(.) = sql:variable("@middleName")]' ) = 1)
select ANV.AttributeId from dbo.AttributesNameValue ANV where (ANV.Attribute.exist( '/attributeCollection/attribute[data(name) = sql:variable("@elementName")] /value[data(.) = sql:variable("@middleName")]' ) = 1)
select AT.AttributeId from dbo.AttributesTyped AT where (AT.Attribute.exist( '/attributeCollection/middleName[data(.) = sql:variable("@middleName")]' ) = 1)
select AT.AttributeId from dbo.AttributesTyped AT where (AT.Attribute.exist( '/attributeCollection/*[local-name() = sql:variable("@elementName")] [data(.) = sql:variable("@middleName")]' ) = 1)
select ANV.AttributeId from dbo.AttributesNameValue ANV where (ANV.Attribute.query( '/attributeCollection/attribute[data(name) = "middleName"]/value' ).value('.', 'varchar(32)') = @middleName)
select ANV.AttributeId from dbo.AttributesNameValue ANV where (ANV.Attribute.query( '/attributeCollection/attribute[data(name) = sql:variable("@elementName")]/value' ).value('.', 'varchar(32)') = @middleName)
select AT.AttributeId from dbo.AttributesTyped AT where (AT.Attribute.query( '/attributeCollection/middleName' ).value( '.' ,'varchar(32)' ) = @middleName)
select AT.AttributeId from dbo.AttributesTyped AT where (AT.Attribute.query( '/attributeCollection/*[local-name() = sql:variable("@elementName")]' ).value( '.' ,'varchar(32)' ) = @middleName)
SQL Server 2005 Best Practices Analyzer has been released on July 2nd, and is available at this site:
I am extremely pleased to tell you, dear reader(s), that on July 1st I have been presented with the 2007 Microsoft MVP award in the »Windows Server System – SQL Server« category.
I would like to take this opportunity to thank everybody at the MVP Program for considering my participation in the communities to be worthy of this award.
I would certainly like to thank Dejan Sarka for introducing me to the depths of this ocean called SQL.
And of course I thank you, the readers of this blog and the members of the communities, for keeping me up late at night – learning. ;)
In case you've missed it, the May 2007 update of SQL Server 2005 Books Online has been made available for download on June 12th.
The 135 MB setup file can be downloaded from this site:
Of course, the online edition has also been updated accordingly.
Also, Windows Live Search now supports SQL Server 2005 Books Online Scoped Search, introducing a much needed way of focusing on relevant subject matter, rather than simply relying on users to provide (or not) additional criteria to the search string (e.g. "SQL Server", "SQL 2005", etc.). Compare scoped search with the 'general' MSDN search.
The PowerPoint slideshow of the presentation I held at this year's NT Conference is ready for download. The presentation is in Slovene, but I'm willing to translate it to English (or German or Latin – if you give me some time ;) – leave a comment in this blog if you're interested.
The presentation provides an overview of cryptographic capabilities of SQL Server 2005, focusing on the Database Engine. I also provide a brief (and simplified) general overview of cryptography as a way to improve the security of data exchange.
The demonstration of how digital signatures can be used to allow cross-database access without resorting to cross-database ownership chaining is a simplified version of the demonstration provided by Raul Garcia in his blog, and is also available for download. I've included an additional script for you to play with.
If you've decided on improving SQL Server security using encryption then perhaps this is the first thing to consider: encrypting database files. Microsoft Windows (versions 2000 and later) support file encryption on NTFS drives – this is a certain way of preventing access to database files on illegally acquired devices (e.g. stolen notebooks and/or disks).
If the database has not been created yet then follow these steps:
Encrypting files and folders in Windows NTFS.
If the database already exists then do this:
After completing the steps above the files are encrypted, and any user with sufficient permissions on the SQL Server instance (e.g. ALTER DATABASE) can create encrypted database files provided the correct folder is specified as the file path.
Simple, isn't it – yet so rarely used.
As mentioned before, cryptography is a new functionality in SQL Server 2005. Most aspects of data encryption, digital signing of modules, and the use of cryptographic objects for authentication and authorization are covered in Books Online. Additionally, here are the links to a couple of blogs dedicated to cryptography in SQL Server 2005 (and SQL Server security in general) that I found very helpful while researching this new functionality:
I will be presenting most of what I've discovered regarding cryptography in SQL Server 2005 at this year's NT Conference in Portorož (Slovenia) on Wednesday, May 16th 2007. If you're there, you're very welcome to attend – you might even learn something. ;)
If you're interested in SQL Server 2005 security in general, though, you really should check out the SQL Server 2005 Security Workshop by Dejan Sarka on the same day. In fact, if you're administering or designing solutions on SQL Server you really should not miss this workshop. You can find more details at the NT Conference web site.
But now... a topic that I don't intend to discuss in detail at the conference.
Symmetric keys are recommended for data encryption by Books Online as they provide better performance when encrypting and/or decrypting data in SQL Server 2005 compared to asymmetric keys and certificates (at least according to Microsoft). Cryptographic functions provided by the platform (in general) are not entirely problem-free (e.g. data-size limitations discussed in the previous post), but let's not get into all that this time. Symmetric keys in SQL Server 2005 support the following cryptographic algorithms: DES, TRIPLE_DES, RC2, RC4, RC4_128, DESX, AES_128, AES_192, and AES_256. The actual availability of these depends on the operating system (e.g. AES is not available on Windows 2000 or XP).
Encrypting data with symmetric keys is done using the
EncryptByKey system function. This method requires the symmetric key to be opened prior to the function's invocation using the
OPEN SYMMETRIC KEY statement.
Similarly, data decryption is done pretty much in the same way, using the
DecryptByKey system function (after opening the key). There are two alternatives to this method, though:
DecryptByKeyAutoCert– can be used for data decryption by symmetric keys protected by certificates; and
DecryptByKeyAutoAsymKey– can be used for data decryption by symmetric keys protected by asymmetric keys.
Both these functions open the symmetric key implicitly.
I was curious whether explicit or implicit opening of the symmetric key influences the performance of these functions in any way. For this purpose I've prepared a rather simplistic set of encrypted values available in this script that you can execute in your own favorite testing database.
The script creates a table and fills it up with short strings (in order not to influence execution significantly). In addition, a certificate is created and then used to protect a newly created symmetric key. This symmetric key uses the DESX cryptographic algorithm.
To see how decryption functions perform on this rather small but IMHO fairly representative set of encrypted values I first traced the execution of two retrieval queries against the data without decrypting the values. This was then followed by two pairs of queries using the
OPEN SYMMETRIC KEY/
DecryptByKey method and then using the
decryptByKeyAutoCert method. BTW: the script produced 148877 rows on 1128 pages with 44 Bytes of encrypted data per row. The only index is the clustered primary key on
StringId – the question of how encrypted data could or should be indexed will not be answered in this article.
These are the results (the queries can be found at the bottom of this post – just follow the links in the Query column):
In contrast to queries #1, #2 and #3 (unrestricted) the result sets from queries #4, #5 and #6 are restricted to the »first« 100 rows (i.e.
StringId < 101). Queries #1 and #4 retrieve unencrypted values, queries #2 and #5 use the
OPEN SYMMETRIC KEY/
DecryptByKey method, and queries #3 and #6 use the
Apparently opening keys automatically (i.e. implicitly) outperforms opening keys manually (i.e. explicitly). Who would have thought...? Although the advantage is only slight (5–10%) it may prove significant on larger sets of data. Can you reproduce this on your system?
Of course decrypting values puts additional stress on the CPU (compared to the retrieval of encrypted data) but with only 2 additional reads. The latter is needed to retrieve (and decrypt) the symmetric key. Well, there you have it. In one of the next posts we'll try to discover why asymmetric keys and certificates are said to be less desired for data encryption (performance-wise) than symmetric keys.
select dbo.Strings.StringId ,dbo.Strings.EncryptedString from dbo.Strings
open symmetric key CryptographyPerformanceSymKey decryption by certificate CryptographyPerformanceCert select dbo.Strings.StringId ,cast(decryptByKey( dbo.Strings.EncryptedString ) as nvarchar(max)) as String from dbo.Strings close symmetric key CryptographyPerformanceSymKey
select dbo.Strings.StringId ,cast(decryptByKeyAutoCert( cert_id('CryptographyPerformanceCert') ,null ,dbo.Strings.EncryptedString ) as nvarchar(max)) as String from dbo.Strings
select dbo.Strings.StringId ,dbo.Strings.EncryptedString from dbo.Strings where (dbo.Strings.StringId < 101)
open symmetric key CryptographyPerformanceSymKey decryption by certificate CryptographyPerformanceCert select dbo.Strings.StringId ,cast(decryptByKey( dbo.Strings.EncryptedString ) as nvarchar(max)) as String from dbo.Strings where (dbo.Strings.StringId < 101) close symmetric key CryptographyPerformanceSymKey
select dbo.Strings.StringId ,cast(decryptByKeyAutoCert( cert_id('CryptographyPerformanceCert') ,null ,dbo.Strings.EncryptedString ) as nvarchar(max)) as String from dbo.Strings where (dbo.Strings.StringId < 101)
Although SP2 for SQL Server has been released more than a week ago I haven't yet had the time to upgrade. Which could also be interpreted as good fortune as yesterday SP2a has been released bringing a fix to the »original« SP2. You can read more about it in this KB article.
In brief: if you use History Cleanup tasks or Maintenance Cleanup tasks in your SQL Server 2005 maintenance plans (or IS packages) and you've already upgraded to SP2 using an installation package downloaded *before* March 5th 2007, you should either:
AFAIK »SP2a« is also not the official name of this release.
The build number of the Critical Update mentioned above is 3050 (bringing the full version number to 9.00.3050), the build number of the corrected SP2 remains unchanged at 3042 (full version: 9.00.3042). Confusing? Well...
Finally, don't forget to replace the old SP2 files in your local repository with the new ones.
After the December 2006 CTP of SQL Server 2005 SP2 the second Service Pack has RTM'ed today and is ready for download at the official site.
The official site contains links to individual downloads, here are a few essential short cuts:
The build number of SP2 is 3042 (the full version number is therefore 9.00.3042).
I guess this means that Windows Vista installations can finally be improved with a state-of-the-art RDBMS product. ;)
We've discussed the basics of XML in SQL, we've seen a few examples of XQueries and their performance; I guess we're ready to go a bit deeper. In this article we'll look more closely at XPath expressions and see how they behave in SQL Server 2005 XML Queries.
First of all there are several ways of formulating a valid XPath expression – for a comprehensive overview of the subject please consult the W3C Recommendations on XPath (mentioned in previous posts) and other sources (e.g. the XPath Tutorial at W3Schools Online Web Tutorials).
Generally, we have two options (with examples later in this article):
//" – anywhere, "
element::*" or "
*" – any element, "
attribute::*" or "
@*" – any attribute;1 of course more on the subject can be found in the W3C Recommendations). Although wildcards can also be used as part of the absolute or the relative path, such XPath expressions cannot be considered exact.
Specifying the exact path is generally the fastest, but it requires quite an intimate knowledge of the XML in question, and may require existing XPath expressions (e.g. used in T-SQL modules) to be corrected if the XML structure is subsequently modified for whatever reason. Approximate paths, on the other hand, are less structure-dependent, but they suffer from a higher performance cost, especially in complex structures.
Of course neither of the two options may lead to the desired node if more nodes exist in a given XML that correspond to a particular XPath expression. It is therefore necessary to extend the expression in such cases with an appropriate XPath predicate.
If we were to compare XPath/XQuery expressions with T-SQL queries we could say that the XPath expression corresponds to the T-SQL
FROM clauses and the XPath predicate functions as the T-SQL
WHERE clause. An appropriate XPath predicate is practically indispensable in situations where the XPath expression should point to a specific node – e.g. if a singleton is needed (as is the case with the
value() XML method; we've seen such a case before).
For instance, using the following XML:
declare @example xml set @example = N'<document> <title>Hamsters prefer pumpkin seeds</title> <paragraph>George B. Quick is a golden hamster.</paragraph> <paragraph>He''s quick when curious about his surroundings and even quicker when hungry.</paragraph> <paragraph>Although he''s presented with a wide variety of tasty and nutritious goodies each day, George prefers shiny pumpkin seeds. </paragraph> </document>'
...all the following queries return the value of the
title element using three different XPath expressions:
-- Exact (absolute) path select @example.query( '/document/title' ).value( '.' ,'varchar(max)' ) as Title -- Approximate path using the "anywhere" wildcard select @example.query( '//title' ).value( '.' ,'varchar(max)' ) as Title -- Approximate path using the "any element" wildcard select @example.query( '/*/title' ).value( '.' ,'varchar(max)' ) as Title
No predicates are necessary in this particular example as there is only a single
title element in our XML. However, if we wanted to extract the value of a particular
paragraph element we might end up with unexpected results:
select @example.query( '/document/paragraph' ).value( '.' ,'varchar(max)' ) as Paragraph select @example.query( '//paragraph' ).value( '.' ,'varchar(max)' ) as Paragraph select @example.query( '/*/paragraph' ).value( '.' ,'varchar(max)' ) as Paragraph
All three queries return a concatenation of values from all
paragraph elements. While this may be valid for some it may not be valid for all queries. What can we do? Well, with an appropriate predicate we can restrict the XML Query result. For instance, we can qualify the XPath expression using the
paragraph element's position:
-- Extract the value of the second paragraph... -- ...using the full syntax for the predicate: select @example.query( '//paragraph[position() = 2]' ).value( '.' ,'varchar(max)' ) as Paragraph -- ...using shorthand: select @example.query( '//paragraph' ).value( '.' ,'varchar(max)' ) as Paragraph
Although this method is frequently used in printed and online references on the subject of querying XML data (including Books Online!) I strongly advice against it, as it serves as a doorway to a whole world of possible problems. According to the XML specification the original order of elements need not be preserved (if several elements of the same name are allowed in a given XML) – meaning that it is perfectly legal to change element positions, to insert additional elements anywhere inside the existing element island(s), etc. Even XML Schema won't prevent this.
Being presented with an unfortunate XML as the one in our example there is little we can do in XML queries to prevent unexpected or incorrect results due to changes in the XML. If, on the other hand, the XML structure includes a means of uniquely identifying individual nodes, we can use more resilient XPath expressions:
declare @example xml set @example = N'<document> <title>Hamsters prefer pumpkin seeds</title> <paragraph id="1">George B. Quick is a golden hamster.</paragraph> <paragraph id="2">He''s quick when curious about his surroundings and even quicker when hungry.</paragraph> <paragraph id="3">Although he''s presented with a wide variety of tasty and nutritious goodies each day, George prefers shiny pumpkin seeds. </paragraph> </document>' -- Exact (absolute) path using the appropriate predicate select @example.query( '/document/paragraph[@id = 2]' ).value( '.' ,'varchar(max)' ) as Paragraph -- Approximate path using the "anywhere" wildcard and the appropriate predicate select @example.query( '//paragraph[@id = 2]' ).value( '.' ,'varchar(max)' ) as Paragraph -- Approximate path using the "any element" wildcard and the appropriate predicate select @example.query( '//*[@id = 2]' ).value( '.' ,'varchar(max)' ) as Paragraph
In the example above we've modified the XML by adding attributes that uniquely identify each repeated element – this way we can use XPath expressions that make the query no longer depend on element positions, but rather on more reliable properties. Of course this is not the only method that can be used nor may it be the most appropriate in all situations. We can quite clearly see that data integrity does start in the XML, and typing the XML (e.g. with XML Schema) is only one of the many steps we have to take to maintain it.
Undoubtedly, data integrity is important, but we should not neglect the performance of data retrieval.
In our performance test2 we will focus on the use of XML retrieval methods in query conditions – i.e. a comparison of various XPath expressions in the
WHERE clause and their effect on performance. The queries used in this test are listed in footnotes at the bottom of this article. Please, refer to Part Two of the series for DDL and sample data. Note that both tables are XML-indexed and appropriate XML schema collections are used to type XML columns.
(All XML indexes)
|Query||EAV Schema||Query||Typed Schema|
As we've already seen in previous articles the performance of XML queries in comparison to T-SQL queries is quite poor. However, it should be apparent now that we can actually improve the performance of XML retrieval simply by using more appropriate XPath expressions.
For the EAV XML type approximate XPath expressions seem to be less resource-intensive, while for the strongly typed XML using an exact path will yield better performance. Of course, once again it seems that SQL Server prefers typed XML.
Anyway, we're not quite done yet. Believe it or not, there still remains some room for improvement. And it won't require any changes to the XML or the XML Schema – the XPath expressions can be further improved. How about halving CPU time and decimating reads?
(All XML indexes)
|Query||EAV Schema||Query||Typed Schema|
By using even more exact XPath expressions we can reduce CPU time by 2 to 3 times and reduce the number of reads by 3 to 8 times! This *is* without a doubt a significant amount.
But where's the catch?
Well, where does the value of a simple (or mixed) XML element actually reside? In its
text node, that's where. It is therefore logical to look for it right there and nowhere else, and apparently this is exactly how XML data retrieval works in SQL: by specifying an XPath expression that is as exact as possible we allow the database engine to navigate past other possible locations (e.g. child nodes) and go directly to the intended target (the bull's-eye).
Even though our target element holds no descendants (and – according to XML schema collections being used – isn't allowed to hold any) the "
/attributeCollection/firstName" XPath expression actually points to the entire element (i.e. a node-set, not just a node!), while the "
/attributeCollection/firstName/text()" XPath expression points to a single final node. Well, this is what I've observed – after all, I haven't designed the darn thing. ;) It makes perfect sense, though.
What have we learned?
Next time we'll take a look at something rather unexpected and quite useful (in some respects).
declare @firstName varchar(8) set @firstName = 'Karen'
select ANV.AttributeId as AttributeId from dbo.AttributesNameValue ANV where (ANV.Attribute.query( '/attributeCollection/attribute[name = "firstName"]/value' ).value( '.' ,'varchar(max)' ) = @firstName)
select ANV.AttributeId as AttributeId from dbo.AttributesNameValue ANV where (ANV.Attribute.query( '//attribute[name = "firstName"]/value' ).value( '.' ,'varchar(max)' ) = @firstName)
select ANV.AttributeId as AttributeId from dbo.AttributesNameValue ANV where (ANV.Attribute.query( '//value[../name = "firstName"]' ).value( '.' ,'varchar(max)' ) = @firstName)
select ANV.AttributeId as AttributeId from dbo.AttributesNameValue ANV where (ANV.Attribute.query( '/attributeCollection/attribute[name/text() = "firstName"]/value/text()' ).value( '.' ,'varchar(max)' ) = @firstName)
select ANV.AttributeId as AttributeId from dbo.AttributesNameValue ANV where (ANV.Attribute.query( '//attribute[name/text() = "firstName"]/value/text()' ).value( '.' ,'varchar(max)' ) = @firstName)
select ANV.AttributeId as AttributeId from dbo.AttributesNameValue ANV where (ANV.Attribute.query( '//value[../name/text() = "firstName"]/text()' ).value( '.' ,'varchar(max)' ) = @firstName)
select AT.AttributeId as AttributeId from dbo.AttributesTyped AT where (AT.Attribute.query( '/attributeCollection/firstName' ).value( '.' ,'varchar(max)' ) = @firstName)
select AT.AttributeId as AttributeId from dbo.AttributesTyped AT where (AT.Attribute.query( '//firstName' ).value( '.' ,'varchar(max)' ) = @firstName)
select AT.AttributeId as AttributeId from dbo.AttributesTyped AT where (AT.Attribute.query( '//*[local-name() = "firstName"]' ).value( '.' ,'varchar(max)' ) = @firstName)
select AT.AttributeId as AttributeId from dbo.AttributesTyped AT where (AT.Attribute.query( '/attributeCollection/firstName/text()' ).value( '.' ,'varchar(max)' ) = @firstName)
select AT.AttributeId as AttributeId from dbo.AttributesTyped AT where (AT.Attribute.query( '//firstName/text()' ).value( '.' ,'varchar(max)' ) = @firstName)
select AT.AttributeId as AttributeId from dbo.AttributesTyped AT where (AT.Attribute.query( '//*[local-name() = "firstName"]/text()' ).value( '.' ,'varchar(max)' ) = @firstName)