Monday, May 09, 2011

SQL Server: The XPath ID Function

XML provides a simple and efficient way of storing relational data, especially for the purposes of transporting it from one RDBMS to another. Besides the "natural" technique, where the structure of the XML document is used to represent the relationships, XML Schema also provides three special data types that can be used to define the relationships inside an XML document. I've already discussed the former in my chapter of SQL Server MVP Deep Dives, so in this article we will examine the latter.


The xs:ID, xs:IDREF and xs:IDREFS XML Schema Data Types

The ID data type is a derived XML Schema data type:

The top-most type above is the string data type, a primitive XML Schema data type, and the types following it are derived from string by restriction – each derivate is more restrictive in respect to its properties and use than its base type.

While a valid string node can contain any character, a valid ID must begin with a letter or an underscore ("_") character, may contain any numeral and/or letter, but may only contain a restricted set of punctuation characters (the dot ".", the dash "-", and the underscore "_"), and it must not contain any blank characters (spaces, tabs, carriage-return or line-feed characters).1

The purpose of the ID data type is to serve as a key – to uniquely identify each individual element of an XML document. Therefore, each ID value inside an individual XML document must be unique. This also serves a second purpose, namely to allow each uniquely identified element to be referenced by one or more related elements of the same document – via an IDREF or an IDREFS attribute.

...
ID
IDREF ... IDREFS

The IDREF data type is derived from the ID type – also by restriction: IDREF nodes can only contain valid ID values that exist inside the same XML document. For instance, in the following XML document, the address/@address_id attribute (typed as ID) represents a unique key of each address element, and the person/@address_id attribute (typed as IDREF) represents a reference to an address element. However, in this particular example only the first person element contains a valid reference, while the second one does not:

An address element with @address_id of "A3" does not exist in the above document.

The IDREFS data type is derived from the IDREF data type by list – it may contain a single ID reference or a space-delimited list of ID references. For instance, in the following example, the person/@address_id attribute (this time typed as IDREFS) contains multiple references:

NB: If the examples in this article aren't displayed correctly, and you're using Internet Explorer 9, please, use Compatibility View.

To illustrate the use of XML to store relational data by implementing these special XML Schema data types, I've created two XML documents from the data available in the pubs database, which is (or rather, used to be) a simplistic, yet robust data sample used by SQL Server documentation. The sample database is (still) available online.

For those of you who never worked with the pubs sample: it represents a database used in a library application used in the management of information about books, authors, publishers, etc. In this article I use two particular relationships from the pubs data model:

  1. Each book is published by exactly one publisher, and each publisher can publish zero, one or more books;
  2. Each book is written by one or more authors, and each author has written one or more books.2

Publishers and Titles

The relationship between the publishers and the books is shown in the following XML document:

The document contains a set of publishers (publisherCollection), each with a unique identifier stored in the @publisherId attribute (publisher/@publisherId), and a set of books (titleCollection), each with a valid reference to a publisher (title/@publisherId).

The following XML Schema governs the integrity of the "Publishers and Titles" XML document:

I prepared a script that you can use to create the corresponding XML Schema Collection in your copy of the pubs database. You will need to do so in order to try out the examples in this article. (NB: you can safely remove both XML Schema Collections presented in this article after you're done experimenting.)

I used the following T-SQL query to compose the first XML document, demonstrating the relationship between the Publishers and the Titles:

begin
 with xmlnamespaces
  (
  'http://www.w3.org/2001/XMLSchema' as xs
  ,'http://schemas.milambda.net/pubs' as bp
  )
 select cast((
  select (
   select [@bp:publisherId]
     = N'Pub_'
     + publishers.pub_id
    ,publishers.pub_name as [bp:name]
    ,publishers.city as [bp:city]
    ,publishers.[state] as [bp:state]
    ,publishers.country as [bp:country]
    from dbo.publishers
    where (exists (
      select *
       from dbo.titles
       where (titles.title_id like 'BU%')
        and (titles.pub_id = publishers.pub_id)
      ))
    for xml path('bp:publisher'), root('bp:publisherCollection'),
        elements xsinil, type
   )
   ,(
   select [@bp:titleId]
     = N'Titl_'
     + titles.title_id
    ,[@bp:publisherId]
     = N'Pub_'
     + titles.pub_id
    ,titles.title as [bp:title]
    ,titles.[type] as [bp:type]
    ,titles.price as [bp:price]
    from dbo.titles
    where (titles.title_id like 'BU%')
    for xml path('bp:title'), root('bp:titleCollection'),
        elements xsinil, type
   )
  for xml path(''), root('bp:pubs'), type) as xml(dbo.BookPublisher))
end

The query above produces a typed XML document. Proper typing is needed, not only to preserve data integrity, but also to provide the XML processor with information about the relationships that exist inside the XML document.


Titles and Authors

The relationship between the authors and the books is shown in the following XML document:

This document contains a set of authors (authorCollection), each with a unique identifier stored in the @authorId attribute (author/@authorId), and a set of books (titleCollection), each with a valid set of references to one or more authors (title/@authorIds).

The "Titles and Authors" XML document is also governed by an XML Schema:

You can use this script to create the corresponding XML Schema Collection in the pubs database.

This XML document was also composed using T-SQL, and is also typed accordingly:

begin
 with xmlnamespaces
  (
  'http://www.w3.org/2001/XMLSchema' as xs
  ,'http://schemas.milambda.net/pubs' as ba
  )
 select cast((
  select (
   select [@ba:authorId]
     = N'Auth_'
     + authors.au_id
    ,authors.au_lname as [ba:lastName]
    ,authors.au_fname as [ba:firstName]
    from dbo.authors
    where (exists (
      select *
       from dbo.titleauthor
        inner join dbo.titles
          on titles.title_id = titleauthor.title_id
       where (titles.title_id like 'BU%')
        and (titleauthor.au_id = authors.au_id)
      ))
    for xml path('ba:author'), root('ba:authorCollection'), type
   )
   ,(
   select [@ba:titleId]
     = N'Titl_'
     + titles.title_id
    ,[@ba:authorIds]
     = (
     select [authorId]
       = N'Auth_'
       + titleauthor.au_id
      from dbo.titleauthor
      where (titleauthor.title_id = titles.title_id)
      for xml path('titleAuthor'), type
     ).query('distinct-values(titleAuthor/authorId)').value('.', 'nvarchar(max)')
    ,titles.title as [ba:title]
    ,titles.[type] as [ba:type]
    ,titles.price as [ba:price]
    from dbo.titles
    where (titles.title_id like 'BU%')
    for xml path('ba:title'), root('ba:titleCollection'),
        elements xsinil, type
   )
  for xml path(''), root('ba:pubs'), type) as xml(dbo.BookAuthor))
end

Since more than one author could have collaborated on a single book, there could be multiple references from a Title to the Authors. SQL Server implements the fn:distinct-values XPath function that I've used in the query above to list multiple references, representing the many-to-many relationship between a Title and its Authors.

The function accepts a single argument, namely an XPath expression pointing to one or more nodes, and returns the atomic values from the resulting node set in form of a space-separated string, without duplicates, which is exactly what we need in our case – a set of references, to be placed in an IDREFS attribute.


The fn:id XPath Function

To query the XML documents composed in this article – for instance, to retrieve the Publisher who published a particular Book, or to retrieve a list of Authors who wrote it – we can use the fn:id XPath function, introduced in SQL Server 2005.

The current SQL Server implementation of this function accepts a single argument, namely an XPath expression pointing to an xs:IDREF (or xs:IDREFS) node containing a reference (or a list of references). The function returns a set of related XML nodes, containing the corresponding ID values.

  1. To retrieve a list of publishers whose books cost less than $11 from the "Publishers and Titles" XML document:
    begin
     with xmlnamespaces
      (
      'http://www.w3.org/2001/XMLSchema' as xs
      ,'http://schemas.milambda.net/pubs' as bp
      )
     select Pubs.Publisher.query('bp:name').value
                ('.', 'varchar(40)') as PublisherName
      from @pubs.nodes
        ('
        id(data(/bp:pubs/bp:titleCollection/bp:title
            [bp:price < 11]/@bp:publisherId))
        ') Pubs (Publisher)
    end
    Result:
    PublisherName
    ----------------------------------------
    New Moon Books
    
    (1 row(s) affected)
  2. To retrieve a list of authors whose books cost less than $12 from the "Titles and Authors" XML Document:
    begin
     with xmlnamespaces
      (
      'http://www.w3.org/2001/XMLSchema' as xs
      ,'http://schemas.milambda.net/pubs' as ba
      )
     select Pubs.Author.query('ba:lastName').value
                ('.', 'varchar(40)') as LastName
      ,Pubs.Author.query('ba:firstName').value
                ('.', 'varchar(20)') as FirstName
      from @pubs.nodes
        ('
        id(data(/ba:pubs/ba:titleCollection/ba:title
            [ba:price < 12]/@ba:authorIds))
        ') Pubs (Author)
    end
    Result:
    LastName                                 FirstName
    ---------------------------------------- --------------------
    Green                                    Marjorie
    O'Leary                                  Michael
    MacFeather                               Stearns
    
    (3 row(s) affected)

Can you think of another alternative for each of these queries? Is it more or less complex than the one with the fn:id function?


The fn:idref XPath function

The fn:idref XPath function should do exactly the opposite – for instance, return a list of Books published by a particular Publisher, or a list of Books written by a particular Author. I say should, because unfortunately the fn:idref function is not implemented in SQL Server.

To query a relationship from the perspective of the key, we have to resort to slightly more elaborate techniques. For example, using XQuery:

  1. To retrieve a list of books published by a publisher from Berkeley:
    begin
     with xmlnamespaces
      (
      'http://www.w3.org/2001/XMLSchema' as xs
      ,'http://schemas.milambda.net/pubs' as bp
      )
     select Titles.Title.query('bp:title').value('.', 'varchar(80)') as Title
      from @pubs.nodes
        ('
        for $p in /bp:pubs/bp:publisherCollection/bp:publisher
            [bp:city = "Berkeley"]/@bp:publisherId
        return /bp:pubs/bp:titleCollection/bp:title[@bp:publisherId = $p]
        ') Titles (Title)
    end
    Result:
    Title
    --------------------------------------------------------------------------------
    The Busy Executive's Database Guide
    Cooking with Computers: Surreptitious Balance Sheets
    Straight Talk About Computers
    
    (3 row(s) affected)
  2. To retrieve a list of books published by an author with the last name of Green:
    begin
     with xmlnamespaces
      (
      'http://www.w3.org/2001/XMLSchema' as xs
      ,'http://schemas.milambda.net/pubs' as ba
      )
     select Titles.Title.query('ba:title').value('.', 'varchar(80)') as Title
      from @pubs.nodes
        ('
        for $a in /ba:pubs/ba:authorCollection/ba:author
            [ba:lastName = "Green"]/@ba:authorId
        return /ba:pubs/ba:titleCollection/ba:title
            [contains(string(@ba:authorIds), $a)]
        ') Titles (Title)
    end
    Result:
    Title
    --------------------------------------------------------------------------------
    The Busy Executive's Database Guide
    You Can Combat Computer Stress!
    
    (2 row(s) affected)

Relational Data and XML

There you have it – yet another method of using XML as a means of transporting relational data. This time enforced in composition by dedicated XML Schema data types, and assisted in retrieval by the corresponding XPath function(s). Why wait until the data has reached its destination to verify whether relationships have been preserved accordingly?

Update: If the pubs database is not available in your environment, and if you're merely interested in the retrieval part of this exercise, you can find a simplified variation of the samples in another post.


ML


1 You can find all the details in the XML Schema Datatypes W3C Recommendation.

2 How would we call an author who hasn't written any books? ;)

1 comment:

Brad Schulz said...

This is great stuff! Thank you for a really informative post.

--Brad