snaps & snippets: 2006

Sunday, December 24, 2006

A post not dedicated to SQL

Here's to new beginnings!

May peace and pleasure fill your holidays, and continue into a fantastic 2007!

Wednesday, December 20, 2006

Windows Vista and SQL Server 2005

There have been several discussions in the communities (forums, newsgroups, blogs) on the fact that Vista, the next version of the Microsoft Windows operating system, does not (or will not) support SQL Server 2005. And when I say "discussions", I really should say "complaints".

Yes, SQL Server 2005 and SQL Server 2005 Express are not supported on Vista. That's a fact. Here's another fact: Vista has not been officially released yet.

Although the new operating system has been released to manufacture on November 8^th 2006 it will not be widely available until the end of January 2007. So, how can anyone expect Vista to support SQL Server 2005 if it hasn't been released yet? ;)

Back in September 2006 (that's *before* the new operating system was released to manufacture) Microsoft already announced that Vista (and "Longhorn") will only support SQL Server 2005 Service Pack 2 or later. Obviously, not everyone actually reads (or understands) these announcements. :)

Anyway, yesterday the December 2006 Community Technology Preview (CTP) of SQL Server 2005 SP2 has been made available, which – I assume – will put an end to those completely unfounded rumours.

Me? I'm waiting for SP2 RTM. After all, this is the holiday season. No need to spoil it with installs and re-installs. ;)

Update (February 19^th 2007)

Windows Vista has been officially released on January 29^th 2007 followed by the release of SQL Server 2005 Service Pack 2 on February 19^th which means that SQL Server 2005 can now be used on the new version of this popular Operating System. This should put an end to the confusion.

Monday, November 27, 2006

XQuery 1.0 Becomes a Proposed Recommendation

On November 22^nd the World Wide Web Consortium announced that the XQuery 1.0, XPath 2.0 and XSLT 2.0 became Proposed Recommendations. In other words, it's safe to assume that in the early 2007 we should expect the work on current versions of these three XML management standards to end with proper W3C Recommendations. Obviously, the R&D activities of the XML Query Working Group will not seize, only the current versions of the specifications will have been made final.

From the W3C Glossary:

Proposed Recommendation (PR)
A Proposed Recommendation is a mature technical report that, after wide review for technical soundness and implementability, W3C has sent to the W3C Advisory Committee for final endorsement.
W3C Recommendation (REC)
A W3C Recommendation is a specification or set of guidelines that, after extensive consensus-building, has received the endorsement of W3C Members and the Director. W3C recommends the wide deployment of its Recommendations. Note: W3C Recommendations are similar to the standards published by other organizations.

The specifications are available at the W3C Website:

XQuery 1.0: An XML Query Language – W3C Proposed Recommendation 21 November 2006;
XML Path Language (XPath) 2.0 – W3C Proposed Recommendation 21 November 2006;
XSL Transformations (XSLT) Version 2.0 – W3C Proposed Recommendation 21 November 2006.

The accompainying specifications are also avaliable:

XQuery 1.0 and XPath 2.0 Data Model (XDM) – W3C Proposed Recommendation 21 November 2006;
XSLT 2.0 and XQuery 1.0 Serialization – W3C Proposed Recommendation 21 November 2006;
XML Syntax for XQuery 1.0 (XQueryX) – W3C Proposed Recommendation 21 November 2006;
XQuery 1.0 and XPath 2.0 Formal Semantics – W3C Proposed Recommendation 21 November 2006.

The XQuery 1.0 and XPath 2.0 Functions and Operators (W3C Proposed Recommendation 21 November 2006) specification has also been made available, and – as far as the XPath/XQuery functions are concerned – remains pretty much unchanged from the Candidate Recommendation. This means that so far, the list of XPath/XQuery functions does not need any updating.

The question that remains is whether the RTM version of SQL Server 2005 Service Pack 2 will reflect any of this. I expect no changes in the SQL Server database engine regarding the XML Query Language implementation nor any additional support of the XPath/XQuery functions in the current version of the product, although the CTP of the SP2 already contains certain breaking changes that have received some negative reactions from the community. I guess we'll see more XPath/XQuery support in the next full version of SQL Server.

Friday, November 24, 2006

SQL Server 2005 XML Methods, Part One, Basic Retrieval

As mentioned in one of the previous posts XML became a native data type with SQL Server 2005. Opinions on its usefulness in a relational database are divided – "fragmented", even. ;) The purpose of this series of posts is to shed some light on how the XML data type could be utilized efficiently, and what are the caveats that should be avoided. In this first part specifically, we'll take a look at the built-in data retrieval methods concerning the XML data type.

So, you've decided on using the XML data type in your SQL Server database. Now your main concern is how to access parts of that XML in your queries. There are three data retrieval XML methods available in SQL Server 2005:

query() – returns the results of an XML Query (the entire result of the XML Query in one row). The result is untyped XML;
nodes() – returns a rowset of nodes returned by an XML Query (one row for each node matched by the XML Query expression). The result is a rowset of untyped XML values;
value() – returns the result of an XML Query as a value in the specified data type (the XML Query statement and the name of the data type for the result are passed as parameters). The result of the XML Query used in this function must be a singleton, otherwise an error is raised. The final result is value in the data type specified at the function's invocation.

Which one to use and when?

If values are needed rather than XML nodes then the value function should be used to access the data (obvious, isn't it?). When designing queries it is vital to take into account that the XQuery results evaluate to singletons before the value function can be applied.

Yes, an example might be of help here:

Let's start with the following XML:

declare @xml xml

set @xml = N'<?xml version="1.0" encoding="utf-8"?>
<attributeCollection>
 <attribute>
  <name>firstName</name>
  <value>Catherine</value>
 </attribute>
 <attribute>
  <name>middleName</name>
  <value>R.</value>
 </attribute>
 <attribute>
  <name>lastName</name>
  <value>Abel</value>
 </attribute>
</attributeCollection>'

Executing either of these two queries:

Query #1

select @xml.query(
               '//attribute'
               ).value(
                       'value'
                       ,'varchar(max)'
                       ) as AttributeValue

...or:

Query #2

select Attributes.Attribute.value(
                'value'
                ,'varchar(max)'
                ) as AttributeValue
 from @xml.nodes(
        '//attribute'
        ) Attributes (Attribute)

...leads to the following exception:

Msg 2389, Level 16, State 1, Line #
XQuery [value()]: 'value()' requires a singleton (or empty sequence),
found operand of type 'xdt:untypedAtomic *'

The road to a valid XML singleton lies within the XPath expression – it must evaluate to a single XML node. If we correct the XPath expression in both queries, the exception won't occur:

Query #3

select @xml.query(
                '//attribute/value'
                ).value(
                        '.'
                        ,'varchar(max)'
                        ) as AttributeValue

...and:

Query #4

select Attributes.Attribute.value(
                '.'
                ,'varchar(max)'
                ) as AttributeValue
 from @xml.nodes(
            '//attribute/value'
            ) Attributes (Attribute)

This brings us to the first consideration regarding the choice of XML data retrieval methods. The principal question we need to answer is purely conceptual and deals with the business purpose of a particular XML: what relationships exist inside the particular XML? And don't say none. ;) The goal of queries against the XML should be clear – clarify it by deciding whether the relationships inside the XML are of value to the relational model.

For example: Query #3 returns a concatenation of values from all nodes corresponding to the XPath expression, while Query #4 returns a set of values:

Results for Query #3:

AttributeValue
-------------------
CatherineR.Abel

(1 row(s) affected)

Results for Query #4:

AttributeValue
-------------------
Catherine
R.
Abel

(3 row(s) affected)

Usually, if the relationships in a particular XML model significantly influence the data use then it would be more appropriate (especially performance-wise) if they were properly catered for in the database model directly (stored separately, relationally constrained, and indexed if applicable) rather than left to dwell within the depths of the XML.

However, there are cases when it is more appropriate (from a business perspective) to leave data as XML entities even in the data store; e.g. document management systems (where each document is stored as an XML entity and a single data model is used to store several classes or types of documents), information platforms (where on the business layer the solution needs to cater to seemingly highly diverse requirements, often leaving the end-users to have uninhibitted control over the structure of their entities, while the solution actually allows for unified physical storage), etc.

In fact, once again there is no straight answer – the actual implementation depends on the requirements of the case in hand. This influences the design of the data model, and the latter influences the choice of appropriate retrieval methods.

Is that it?

Well, this has been an incredibly brief introduction to the subject of XML retrieval methods provided by SQL Server 2005, but we're not done yet! Next time we'll be looking at a few examples where the three built-in XML methods we've seen today are used in various combinations. And we might just as well make a few surprising discoveries. ;)

Friday, November 10, 2006

NOLOCK no good

Query hints should be used with caution, and a special care should be given to locking hints; in Books Online this subject is fully covered, and several SQL professionals out there have added their own views and experience, which is also very evident online. I've also used some examples of how query hints work when discussing optimistic concurrency (and some other places) – purely for illustration purposes.

Too often are locking hints missused, or at least used inappropriately – either when developers ignore the possibility of negative consequences altogether or when they focus purely on performance, rather than balancing the latter with data consistency and accuracy. Visit any SQL Server related newsgroup and you'll find really good examples of really bad query hint abuse. Luckily, these users (still) represent only a less significant minority. But on the other hand, when sensitive data is concerned a single member of this minority with their *special performance imporvements* is enough to cause mayhem. Joe Celko might also say: murder.

Tony Rogerson posted a very interesting article on his blog today dealing with the possibility of truly unpleasant consequences when the NOLOCK hint is used. Take a look at Tony's example, reproduce it on your system, and see for yourself how devastating such possibilities really can be. Great article!

Don't get me wrong, I'm not saying (and I don't think Tony is) that the NOLOCK hint should not be used at all – far from it: sometimes it really is the only way to go, but never should it be used without first considering the alternatives (if any) and possible consequences. In fact, if you ask me, the same goes for any query hint. Test, re-test, double-test, and – just to make sure – test again.

Further more, Tony has also announced more articles on the subject of concurrency and consistency, so keep an eye on his blog. I know I will.

Thursday, October 26, 2006

The expense of down-converting unicode values

Using functions on columns in query conditions is bad practice and should be avoided – I've mentioned that fact before. As it turns out, sometimes the database engine itself chooses to transform column data in order to test a condition, causing a negative performance impact. A serious one as well.

As Marco Russo describes in his post on a problem one of his users has encountered, the database engine apparently sometimes resorts to implicit data conversions when evaluating query conditions, resulting in poor performance as the optimizer fails to utilize Index Seeks and uses Index Scans instead. See this Microsoft Connect Feedback article for the repro and the official explanation from the SQL Server development team.

I've also posted a comment on this subject (twice, as it turns out ;) after I've discovered that the problem only ever occurs with SQL Server collations. Can anyone out there confirm this? Please, respond to the Microsoft Connect website if you can.

Is this yet another reason why Windows collations should be prefered over SQL collations?

After successfully validating the issue I have tried to see what could be done to overcome this apparent anomally and have narrowed the options down to two:

To explicitly cast the unicode value to the appropriate datatype and collation in the query – while this does indeed take care of the problem and gives us the ability to influence the query execution, it seems cumbersome and unintuitive; or
To use Windows collations instead of SQL collations – this would seem to be the simplest solution, but it would require modifications of existing database schemas, and I guess we all know *that* particular snowball game.

However, at the moment I would still prefer a confirmation of whether this issue truly only occurs with SQL Server collations.

Tuesday, October 03, 2006

EXISTS or NOT EXISTS... that is the question here

Whether 'tis nobler to count the rows in tables
And stall the traffic with outrageous pride
Or to merely stop as early as a single row
Fullfills the expectations; To exist: to be found.

Has Hamlet ever designed a data driven solution?

How to check whether one or more rows corresponding to a given set of conditions exist? It should be common knowledge by now that this

if ((
 select count(*)
  from ...
  where ...
 ) > 0)
 begin
  -- logic if rows exist
 end
else
 begin
  -- logic if rows don't exist
 end

looks like the answer, but is not, or at least *might* not be. Nonetheless, methods like this one still manage to creep into production code now and then. Is that all bad? Keep reading – you may be surprised.

Anyway, in T-SQL there's a far more efficient method of checking whether rows exist, and we'll take a look at how efficient it actually is. And perhaps learn something new. Start by creating this table in your favourite SQL 2000 (yes, that's 2K) "testing-ground" database (the script assumes that [SQL2005] is the name of your SQL 2005 server where the AdventureWorks database resides). The only index included in the script is a clustered primary key on the identity column.

`EXISTS`

The expression is described in Books Online, along with some typical examples of use. The example above can thus be rewritten as:

if (exists (
  select *
   from ...
   where ...
  ))
 begin
  -- logic if rows exist
 end
else
 begin
  -- logic if rows don't exist
 end

What's the benefit? First of all, the query is now semantically correct: instead of counting the rows to test whether there is at least one that corresponds to the search criteria, we simply query the database engine for their existence. The more important aspect is the performance gain: queries using the EXISTS expression outperform those using the "if-count-greater-than-zero" alternative. Well, they do in SQL 2000.

The example queries are available at the bottom of the post – follow the links...

Query	SQL 2000 Performance
	Clustered Primary Key
	CPU	Reads
#1	15	118
#2	0	7

118 reads vs. 7 (seven!) against a table with 31465 rows in total (30086 of those correspond to the criteria). The EXISTS expression performs approximately 16 times faster than the one based on the row count.

How does SQL Server process an EXISTS expression? To put it as simply as possible: data processing stops as soon as the condition is met, and only as much data is actually read as is needed to test whether the condition is satisfied by *any* row since no rows need to be returned to the client. This means that if an appropriate index exists, the database engine (theoretically) only needs to access a single index page.

Query	SQL 2000 Performance
	Covering Index
	CPU	Reads
#1	15	92
#2	0	2

With a covering index we're down to 2 reads for the EXISTS expression. The performance of the "if COUNT > 0" method is also slightly improved by the covering index as all processing can be done on index pages.

* or 1 or what?

There are three methods of forming the EXISTS query that can be observed in practice:

exists (select * ...)

exists (select 1 ...)

exists (select %table_primary_key% ...)

Each of the above examples comes with a set of at least three users vigorously defending theirs as the most correct of all. The fact remains that the SELECT statement of the EXISTS expression does not affect the execution plan at all. IMHO, option #3 is the least sensible but still there's only one reason why it shouldn't be used: the query would cause errors if the table's primary key column(s) were renamed.

According to an urban myth option #1 is supposed to be the only correct one as "it allows the optimizer to choose the most appropriate index". This is perfecly unfounded as all three examples produce the same execution plan. At least in Microsoft SQL Server. => Busted!

With table hints the optimizer can be "guided" (=forced) into using a specific index. However, the use of index hints should be revised when (perhaps more appropriate) indexes are added or when the schema changes or even when the table usage changes.

Query	SQL 2000 Performance
	Clustered Index Scan			Nonclustered Index Seek ^{(Covering Index)}
	Reads	Estimated Row Size ^(Bytes)	Estimated I/O Cost	Reads	Estimated Row Size ^(Bytes)	Estimated I/O Cost
#3	7	45	0,0032066	2	45	0,0032041
#4	8	45	0,0032066	2	45	0,0032041

Query	SQL 2005 Performance
	Clustered Index Scan			Nonclustered Index Seek ^{(Covering Index)}
	Reads	Estimated Row Size ^(Bytes)	Estimated I/O Cost	Reads	Estimated Row Size ^(Bytes)	Estimated I/O Cost
#3	7	15	0,086088	2	9	0,0668287
#4	8	23	0,086088	2	15	0,0668287

It's safe to assume that an appropriate nonclustered index will be the optimizer's preferred choice. In fact, covering indexes once again take the gold.

But now, here's a bit of a surprise...

SQL Server 2005 knows about "if-count-greater-than-zero"

In SQL 2005 another performance improvement becomes apparent, sadly only when querying with the "COUNT > 0" or the "COUNT >= 1" condition.

Query	SQL 2005 Performance
	Clustered Primary Key		Covering Index
	CPU	Reads	CPU	Reads
#1	0	7	0	2
#2	0	7	0	2
#5	0	7	0	2
#6	20	115	15	89

What?! Has counting rows been improved this much in SQL 2005? Nope. Query #6 tests whether there are more rows than 1 and the behaviour is back to expected. Let's compare execution plans:

The expected behaviour (COUNT in SQL 2000);

(using the clustered primary key):

|--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1002]>0) then 1 else 0))
     |--Nested Loops(Inner Join)
          |--Constant Scan
          |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1013])))
               |--Stream Aggregate(DEFINE:([Expr1013]=Count(*)))
                    |--Filter(WHERE:([SalesData].[OrderDate]<=getdate()))
                         |--Clustered Index Scan(OBJECT:(
                            [TestingGround].[dbo].[SalesData].[pk_SalesData_SalesDataId]),
                            WHERE:([SalesData].[OrderDate]>='Jan  1 2002 12:00AM'))

or (using the nonclustered covering index):

|--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1002]>=1) then 1 else 0))
     |--Nested Loops(Inner Join)
          |--Constant Scan
          |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1013])))
               |--Stream Aggregate(DEFINE:([Expr1013]=Count(*)))
                    |--Index Seek(OBJECT:(
                       [TestingGround].[dbo].[SalesData].[x_SalesData_OrderDate_TotalDue]),
                       SEEK:([SalesData].[OrderDate] >= 'Jan  1 2002 12:00AM'
                       AND [SalesData].[OrderDate] <= getdate()) ORDERED FORWARD)

The surprising improvement (COUNT > 0, COUNT >= 1 and EXISTS in SQL 2005);

(using the clustered primary key):

|--Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1006] THEN (1) ELSE (0) END))
     |--Nested Loops(Left Semi Join, DEFINE:([Expr1006] = [PROBE VALUE]))
          |--Constant Scan
          |--Filter(WHERE:([TestingGround].[dbo].[SalesData].[OrderDate]<=getdate()))
               |--Clustered Index Scan(OBJECT:(
                  [TestingGround].[dbo].[SalesData].[pk_SalesData_SalesDataId]),
                  WHERE:([TestingGround].[dbo].[SalesData].[OrderDate]
                  >='2002-01-01 00:00:00.000'))

or (using the nonclustered covering index):

|--Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1006] THEN (1) ELSE (0) END))
     |--Nested Loops(Left Semi Join, DEFINE:([Expr1006] = [PROBE VALUE]))
          |--Constant Scan
          |--Index Seek(OBJECT:(
             [TestingGround].[dbo].[SalesData].[x_SalesData_OrderDate_TotalDue]),
             SEEK:([TestingGround].[dbo].[SalesData].[OrderDate] >= '2002-01-01 00:00:00.000'
             AND [TestingGround].[dbo].[SalesData].[OrderDate] <= getdate()) ORDERED

Notice the Stream Aggregate operator in the first case and its mysterious absence from the second?

Obviously, SQL Server 2005 "knows" what "if-count-greater-than-zero" and "if-count-at-least-one" mean – the execution plans for "COUNT > 0", "COUNT >= 1" and EXISTS are the same. Further more, setting the compatibility level to 80 or lower does not stop the new behaviour. Is this the result of an intended improvement or merely a side-effect of some other improvement?

The world keeps on changing

There certainly is strength in the EXISTS expression, and starting with SQL 2005 there appears to be strength in methods previously thought of as weak. At least this is true of the two presented here.

P.S. If you're looking for the exist XML method, you can find it right here.

^* Queries used in the demonstration:

Query #1:

if ((
 select count(*)
  from dbo.SalesData
  where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
 ) > 0)
 begin
  print 1
 end
else
 begin
  print 0
 end

Query #2:

if (exists (
  select *
   from dbo.SalesData
   where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
  ))
 begin
  print 1
 end
else
 begin
  print 0
 end

Covering index DDL script:

create nonclustered index x_SalesData_OrderDate_TotalDue
 on dbo.SalesData
  (
  OrderDate
  ,TotalDue
  )

Query #3:

-- Forcing the use of the clustered primary key:
if (exists (
  select *
   from dbo.SalesData with(index(pk_SalesData_SalesDataId))
   where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
  ))
 begin
  print 1
 end
else
 begin
  print 0
 end

-- Forcing the use of the covering nonclustered index:
if (exists (
  select *
   from dbo.SalesData with(index(x_SalesData_OrderDate_TotalDue))
   where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
  ))
 begin
  print 1
 end
else
 begin
  print 0
 end

Query #4:

-- Forcing the use of the clustered primary key:
if (exists (
  select *
   from dbo.SalesData with(index(pk_SalesData_SalesDataId))
   where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
    and (dbo.SalesData.TotalDue = 3953.9884)
  ))
 begin
  print 1
 end
else
 begin
  print 0
 end

-- Forcing the use of the covering nonclustered index:
if (exists (
  select *
   from dbo.SalesData with(index(x_SalesData_OrderDate_TotalDue))
   where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
    and (dbo.SalesData.TotalDue = 3953.9884)
  ))
 begin
  print 1
 end
else
 begin
  print 0
 end

Query #5:

if ((
 select count(*)
  from dbo.SalesData
  where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
 ) >= 1)
 begin
  print 1
 end
else
 begin
  print 0
 end

Query #6:

if ((
 select count(*)
  from dbo.SalesData
  where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
 ) > 1)
 begin
  print 1
 end
else
 begin
  print 0
 end

Thursday, September 28, 2006

SQL 2005 and tempdb

With the expansion of the row versioning framework in SQL Server 2005 tempdb becomes an even more important part of the platform than it used to be. Snapshot transaction isolation as well as database snapshots make big use of this "temporary" storage facility and keeping track of its usage is the key to the continuous efficiency of data management solutions based on the SQL Server platform.

As mentioned in my post on SI in SQL 2005 there are several dynamic management views that can be used to monitor tempdb activity and usage, but for a more comprehensive overview of the subject I recommend this fine article by Wei Xiao, Matt Hink, Mirek Sztajno and Sunil Agarwal.

IMHO a must-read when considering the implementation of Snapshot Isolation.

And speaking of tempdb – have you read this article by Tony Rogerson?

Update (November 14^th 2006)

On the SQL Server Best Practices site we can find another comprehensive study of the use of tempdb, specifically for the purposes of index creation and rebuilding:

TEMPDB Capacity Planning and Concurrency Considerations for Index Create and Rebuild by Sanjay Mishra (Technical reviewers: Tom Davidson, Mark Souza, Prem Mehra, Sunil Agarwal, Mike Ruthruff; Project Editor: Suzanne Bonney).

The SQL Server Best Practices Toolbox includes two methods useful in monitoring tempdb use:

List Real Time Tempdb Task Usage – a sample script that lists the real-time task usage of tempdb in SQL Server 2005; and
List Real-Time Tempdb Statements – a smple script that lists real-time statements that use tempdb in SQL Server 2005.

Thursday, September 21, 2006

Date[Time] constructor SQL-style

Querying temporal data in SQL Server has been discussed on many occasions, many tips and how-to's can be found online, best practices regarding date/time data that have evolved over time have been discussed in blogs, articles and books. As the eternal quest in developing practical, simple and efficient solutions regarding temporal data in SQL continues I've decided to chip in and show you a pretty simple example of how datetime values should be queried in SQL Server in order to maximize performance and at the same time how to bridge the gap between business and data store concepts of time.

In most information management solutions keeping and accessing temporal data is vital to the business needs – e.g. auditing, managing time-critical information, tracking the evolution of data, etc. Understanding the concept of time implemented in SQL Server is therefore highly important.

Most tips regarding the querying of date/time data revolve around one highly important aspect – performance. While SQL Server performs best with small data types (e.g. exact numerics, fixed-length character data types (up to a point), date/time values, etc.), it's usually the users who fail to take advantage of this fact. Too often do we see queries like this one; they bring performance down to a halt:

-- #1

use AdventureWorks
go

select *
 from Sales.SalesOrderHeader
 where (year(OrderDate) = 2001)
  and (month(OrderDate) = 7)
go

Fix it before it breaks you

First of all, the dreaded "all-columns-wildcard" has no place in production systems (with the possible exception of the EXISTS expressions, but that's another story). So, the obvious solution is to limit the number of columns in the result set to the ones that we actually need in a particular case – let's focus on the SalesOrderID, Status, CustomerID, SalesPersonID and the TotalDue columns for this example.

-- #2

select SalesOrderID
 ,Status
 ,CustomerID
 ,SalesPersonID
 ,TotalDue
 from Sales.SalesOrderHeader
 where (year(OrderDate) = 2001)
  and (month(OrderDate) = 7)
go

Now, let's optimize the execution of our query by indexing the table. I prefer covering indexes, don't you? ;)

create nonclustered index x_SalesOrderHeader_OrderDate
 on Sales.SalesOrderHeader
  (
  OrderDate
  ,SalesOrderID
  ,CustomerID
  ,SalesPersonID
  )
 include (
  Status
  ,TotalDue
  )
go

What's the result of this improvement? Let's execute query #2 again. 703 logical reads without the covering index vs. 139 with it. Nice, isn't it? It is, but we can do better. A lot better. We just need to replace what *seems* right with something that *is* right – functions on columns in restrictive conditions of a query may correspond to the business requirements perfectly, but they usually fail to correspond to the principles of efficient data management implemented in the data store (as we'll see later on).

What is "July 2001" in T-SQL?

Quite simply, it's the time between July 1^st 2001 and July 31^st 2001; both boundaries included:

-- #3

select SalesOrderID
 ,Status
 ,CustomerID
 ,SalesPersonID
 ,TotalDue
 from Sales.SalesOrderHeader
 where (OrderDate >= cast('2001-07-01' as datetime))
  and (OrderDate < dateadd(m, 1, cast('2001-07-01' as datetime)))
go

Due to the fact that SQL Server always stores date and time as a single value (look up datetime and smalldatetime in Books Online) the values above are actually implicitly converted at run time to 2001-07-01 00:00:00.000 and 2001-08-01 00:00:00.000 respectively. There are plenty of continuous discussions everywhere on how SQL Server should support separate data types for date and time values, but personally, I see little use in that. Anyway, a month in T-SQL is a time range beginning with the first millisecond of the first day of the month and ending before the first millisecond of the first day of the next month.

Back to our query: suddenly we're down to two logical reads. That's 2 logical reads to get 184 rows from a table containing 31465 rows.

The downside is that – purely for performance reasons – we'd have to propagate SQL-specific thinking up from the data layer: a month would no longer be just a simple value (or two) but a range of values (a set, if you like) – a concept that may just as well be completely irrelevant to the application (and/or the user). Could that be a problem?

The business requirement is simple: "show the data for a specific month", and as far as the solution as a whole is concerned the application logic behind what constitutes a month should not depend on the data layer's concept of a month – and vice versa. By translating the query conditions from the general definition corresponding to the business rule to a more specific definition corresponding to the data store rule we can prevent a conflict of the two concepts. IMHO this is one of the things stored procedures handle perfectly. Some elements of business logic should be implemented in stored procedures, but more importantly – data store-specific functionalities should be kept solely in stored procedures (outside of the application tier, by all means).

Which brings us to the point of this post – a SQL-style date/time constructor could prove useful here.

As is quite obvious from the examples above, the translation of the single business term "July 2001" to a pair of data store-specific terms is what we need – a transition from the logical definition of a temporal quantity to the physical definition of a date/time range.

We'll use two scalar functions for this purpose; they each consume up to three parameters (the values representing the year, the month and the day) and they return a lower and an upper boundary of the date/time range.

create function dbo.Date_LowerBoundary
 (
 @year smallint = null
 ,@month tinyint  = null
 ,@day tinyint  = null
 )
returns datetime
/*
The default value for @year is 1753 (datetime minimum).
*/
as
begin
 declare @minResult datetime
 
 select @minResult
   = case
    when @year between 1753 and 9999
     then cast(@year as char(4))
    else '1753'
    end
   + case
    when @month between 1 and 12
     then right('00' + cast(@month as varchar(2)), 2)
    else '01'
    end
   + case
    when (@month is not null)
     and (@day between 1 and 31)
     then right('00' + cast(@day as varchar(2)), 2)
    else '01'
    end
 
 return @minResult
end
go

create function dbo.Date_UpperBoundary
 (
 @year smallint = null
 ,@month tinyint  = null
 ,@day tinyint  = null
 )
returns datetime
/*
The default value for @year is 9999 (datetime maximum).
*/
as
begin
 declare @minResult datetime
 declare @maxResult datetime
 
 set @year = coalesce(@year, 9999)
 
 select @minResult
   = case
    when @year between 1753 and 9999
     then cast(@year as char(4))
    else '1753'
    end
   + case
    when @month between 1 and 12
     then right('00' + cast(@month as varchar(2)), 2)
    else '01'
    end
   + case
    when (@month is not null)
     and (@day between 1 and 31)
     then right('00' + cast(@day as varchar(2)), 2)
    else '01'
    end
 
 select @maxResult
   = case
    when @year = 9999
     then dateadd(yy, 1, dateadd(ms, -3, @minResult))
    when @month is null
     then dateadd(ms, -3, dateadd(yy, 1, @minResult))
    when @day is null
     then dateadd(ms, -3, dateadd(m, 1, @minResult))
    else dateadd(ms, -3, dateadd(dd, 1, @minResult))
    end
 
 return @maxResult
end
go

The functions provide a layer of abstraction between the business and the data store-specific concept of time. The translation is made in the data access layer; the application should remain oblivious to the concept-shift. The abstraction also allows for less determinism when supplying parameters: if all parameters are unknown (null) then the lowest allowed datetime value is used for the lower boundary and the highest allowed datetime value is used for the upper boundary; only the @year parameter can be specified, or @year and @month together, or all parameters, which gives the highest granularity of the entire SQL Server date/time range and the lowest of a single day.

Warning: the parameters are not validated; the functions assume that the input parameters originate from valid date/time values. A standard system error is raised if the transformed values don't evaluate to datetime.

Actions speak louder than words

Here are two examples of how the two date/time functions could be used.

The functions could be used in-line in the WHERE clause:

-- #4

declare @year smallint
declare @month tinyint
declare @day tinyint

set @year = 2001
set @month = 7

select SalesOrderID
 ,Status
 ,CustomerID
 ,SalesPersonID
 ,TotalDue
 from Sales.SalesOrderHeader
 where (OrderDate
  between dbo.Date_LowerBoundary(@year, @month, @day)
  and dbo.Date_UpperBoundary(@year, @month, @day))
go

The results of the functions could be assigned to variables prior to being used in the WHERE clause:

-- #5

declare @year  smallint
declare @month  tinyint
declare @day  tinyint
declare @lowerBoundary datetime
declare @upperBoundary datetime

set @year = 2001
set @month = 7
set @lowerBoundary = dbo.Date_LowerBoundary(@year, @month, @day)
set @upperBoundary = dbo.Date_UpperBoundary(@year, @month, @day)

select SalesOrderID
 ,Status
 ,CustomerID
 ,SalesPersonID
 ,TotalDue
 from Sales.SalesOrderHeader
 where (OrderDate between @lowerBoundary and @upperBoundary)
go

The upper boundary function returns the last valid datetime value of a given range, not the first valid datetime value of the next range, which means that it can be used with the BETWEEN operator. (And won't break when the year 9999 finally comes ;)

In case you're wondering, BETWEEN is inclusive, which makes

@myDate between @lowerDate and @upperDate

and

@myDate >= @lowerDate and @myDate <= @upperDate

functionally equivalent.

Abstraction costs

Yes, if conceptual simplicity is the goal one should count on a slight performance impact. This is clearly shown in the following datasheet. However, another more important fact is (or should be) far more obvious: using functions on columns in the WHERE clause has a devastating impact on query performance; the reason being that the optimizer can't use (otherwise appropriate) indexes if column values need to be transformed before the conditions can be evaluated.

Query	Query performance
	No additional indexes^*		Covering index^**		Datetime column index^***
	CPU	Reads	CPU	Reads	CPU	Reads
#1	34	703	14	751	24	715
40	703	20	751	20	882
40	703	10	751	20	673
30	703	20	751	31	673
30	703	10	751	20	673
30	703	10	751	30	673
#2	30	703	18	139	22	893
30	703	10	139	20	898
20	703	20	139	20	898
30	703	10	139	20	898
40	703	20	139	20	884
30	703	30	139	30	886
#3	24	703	0	2	8	644
20	703	0	2	10	766
30	703	0	2	10	614
30	703	0	2	0	614
20	703	0	2	10	614
20	703	0	2	10	614
#4	1730	707	4	7	1675	707
1773	707	0	7	1883	707
1632	707	0	7	1673	707
1722	707	10	7	1613	707
1712	707	0	7	1602	707
1813	707	10	7	1602	707
#5	28	707	2	7	44	707
31	707	0	7
40	707	0	7
10	707	10	7
30	707	0	7
30	707	0	7

The most performant option remains the one used in query #3 where the date/time range is prepared in advance. If 5 additional logical reads seem too much in a given situation, then option #3 is the way to go. On the other hand, the functions discussed in this post do provide a means that could at least prove worthy of consideration.

We've successfully reached the highest possible performance using a composite covering index with the datetime column at the principal position (statistics are kept for the first column of a composite index). Of course not all situations can be catered for with covering indexes – the possible impact on insert/update processes should always be considered. The two functions may still provide good results even without a covering index as long as they're not used in the WHERE clause directly; option #5 should be considered rather than option #4 which is apparently more dependent on the appropriate indexing.

^*By default only key columns are indexed in the Sales.SalesOrderHeader table: rowguid, SalesOrderNumber, CustomerID, SalesPersonID, SalesOrderID.

^**The covering index spans these columns: OrderDate, SalesOrderID, CustomerID and SalesPersonID, it includes the Status and TotalDue columns.

^***To demonstrate the actual benefit of the covering index I've included the results of a test using a nonclustered index on the OrderDate column:

create nonclustered index x_SalesOrderHeader_OrderDate
 on Sales.SalesOrderHeader
  (
  OrderDate
  )

Friday, September 08, 2006

Snapshot transaction isolation – optimistic concurrency made easy?

In a previous post we've looked at an example of how the principles of optimistic concurrency should be corrected in order to prevent possible harmful consequences that might come from uninhibited optimism. We've seen how to utilise a rowversion column to track data modifications at row level, and we've reached into the core of how to leverage data modification activities in a multi-tiered solution. No sophisticated locking and detection mechanisms were used in achieving our goal (and no principles of information management were violated in the process). We've also mentioned the fact that in SQL Server 2005 there is an alternative that further simplifies the implementation of the corrected principles of optimistic concurrency.

The subject of this post is how snapshot isolation (SI) introduced in SQL 2005 can be used to implement optimistic concurrency in data management.

SI in SQL Server 2005 is implemented at database level, it has two variations, each with its own characteristics. The foundations of SI lie in the row versioning framework which is an integral part of the database engine. More on the subject is, of course, available in Books Online.

There are two new database settings with the purpose of controlling SI:

ALLOW_SNAPSHOT_ISOLATION – by default the setting is off, and needs to be turned on before SI can be used. The setting is transaction-scoped: the snapshot is created using data that was committed before the snapshot transaction started; and
READ_COMMITTED_SNAPSHOT – also off by default; it allows the snapshots to be used with the (default) read committed isolation level – combining the essential characteristics of snapshot and read committed isolation levels. This setting is statement-scoped: the snapshot is created using data that was committed before each statement started.

Both settings can be modified using the ALTER DATABASE DDL statement. Which brings us to the first consideration – enabling and disabling SI. In accordance with the new SQL 2005 approach to database-level and server-level changes aimed at achieving as much as possible without the need to interrupt data access these two settings can be changed on a live system.

When enabling the READ_COMMITTED_SNAPSHOT setting there must be no other open connections – restricting database use (i.e. single-user mode) is not required, the operation simply waits for all other open connections to close. Which could be really long if for example there are many SSMS connections open at the time. ;)

When enabling the ALLOW_SNAPSHOT_ISOLATION setting special rules apply in order not to block users (follow the link above). The process waits for pending transactions to complete.

SI uses tempdb to store row versions, so make sure tempdb is large enough and maintained regularly.

The situation

Let's imagine two conflicting transactions – two different users attempting to modify the same row of data at the same time. If by any chance you have trouble imagining that, read this post and all will be revealed. ;)

Bottom line: by utilising rowversion columns and adapting data modification logic accordingly conflicting modifications can be prevented – the user is warned of potential conflicts. This should be handled on the database level.

Optimistic concurrency with the snapshot transaction isolation level

Snapshot transaction isolation means that each user who started a snapshot transaction has uninhibited read-access to the last committed version of the row. In snapshot transactions readers are not blocked by writers. Nonetheless, conflicting modifications are not possible – the engine itself prevents it!

The row versioning framework checks whether the row being updated has already been updated in a concurrent transaction after the snapshot transaction was started.

If this is true then the snapshot transaction will be terminated and an exception will occur:

Msg 3960, Level 16, State 5, Line 8
Snapshot isolation transaction aborted due to update conflict. You cannot use
snapshot isolation to access table '%table name%' directly or indirectly in
database '%database name%' to update, delete, or insert the row that has been
modified or deleted by another transaction. Retry the transaction or change
the isolation level for the update/delete statement.

What does this mean? Well, when using snapshot transaction isolation the database engine will automatically track data modification conflicts, prevent them and warn us when they occur. In other words – thanks to the row version framework the use of an extra rowversion column in tables and the necessary logic to support it is no longer needed, that is if the following conditions are met:

Database compatibility level is 90 (or higher) – obviously;
The ALLOW_SNAPSHOT_ISOLATION database setting is ON; and
The transaction is started with the snapshot isolation level.

There are a few limitiations, though:

Distributed transactions using SI are not supported;
DDL operations against database objects may block SI transactions;
Changing certain database setting will cause pending snapshot transactions to fail (e.g. setting the database to read-only, changing settings that lead to database recovery, etc.);
Switching to the snapshot isolation level from any other explicitly set isolation level will cause an exception followed by an immediate rollback of all transactions.

Any other considerations? Cross-database transactions with the snapshot isolation level require all databases involved to have the ALLOW_SNAPSHOT_ISOLATION database setting turned on, unless the limitation is circumvened by using table hints. Attempts to violate this limitation will cause an exception:

Msg 3952, Level 16, State 1, Line 3
Snapshot isolation transaction failed accessing database '%database name%'
because snapshot isolation is not allowed in this database. Use ALTER
DATABASE to allow snapshot isolation.

To circumvene the SI limitation an appropriate table hint can be used when referencing tables in a database where SI is not allowed; e.g.:

set transaction isolation level snapshot

select <column list>
 from <database with SI ON>.<schema>.<table> SI_ON
  inner join <database with SI OFF>.<schema>.<table> SI_OFF with(readcommitted)
    on SI_OFF.<common key> = SI_ON.<common key>

A warning from Books Online:

Because the SQL Server 2005 query optimizer typically selects the best execution plan for a query, we recommend that hints, including <table_hint>, be used only as a last resort by experienced developers and database administrators.

Do SI limitations constitute such a case?

All things considered, using snapshot transaction isolation is a good way to implement optimistic concurrency in SQL 2005, as long as the limitations are taken into account and provided for. It's the simplest alternative, but not the only one.

Optimistic concurrency with the read committed transaction isolation level and snapshots

The key feature of SI is in the way data is read: all committed rows are available to any connection – readers are no longer blocked by writers. In business terms: data access (e.g. for reporting purposes) is instant and as accurate as possible (especially when considering the overly popular and just as questionable alternative: the nolock table hint). IMHO not blocking readers constitutes the principal benefit of the READ_COMMITTED_SNAPSHOT database setting.

Why? Again, there are a few limitations to consider, but the most critical one is the fact that data modification conflicts are not detected automatically by the database engine. This means that even with snapshots the read committed isolation level in SQL 2005 still requires additional logic in data management to properly support optimistic concurrency (e.g. rowversion & Co.) – however, with the prevailing benefit of readers not being blocked by writers.

There's another benefit: read committed transactions are not blocked by any DDL operations against database objects when READ_COMMITTED_SNAPSHOT is set. For instance: rebuilding indexes will cause snapshot transactions to fail, while read committed transactions will not be interrupted.

So, using the rowversion column when implementing optimistic concurrency remains a valid option in SQL Server 2005; even more so in situations where the limitations of the snapshot transaction isolation level rear their ugly head. However, if these aren't relevant to the case in hand then snapshot isolation prevails by being the least difficult of the two to implement.

Mental aggregation

Common facts
– read operations are not blocked by write operations;
– row versions are stored in tempdb.

	Conflict detection	Limitations	Snapshot taken
`ALLOW_SNAPSHOT_ISOLATION`	automatic (the transaction is aborted)	– DDL operations are fatal; – database recovery is fatal; – impossible to switch transaction isolation to snapshot from any other level; – cross-database queries require table hints if the setting is not turned on in all databases involved; – conflict detection cannot be disabled;	before transaction starts
`READ_COMMITTED_SNAPSHOT`	no	cannot be set for tempdb, msdb or master;	before each statement

Keeping an eye on things

How to check whether snapshot transaction isolation is allowed:

select sys.databases.snapshot_isolation_state
 ,sys.databases.snapshot_isolation_state_desc
 from sys.databases
 where (sys.databases.[name] = '<MyDatabase>')

How to check whether read committed transactions use snapshots:

select sys.databases.is_read_committed_snapshot_on
 from sys.databases
 where (sys.databases.[name] = '<MyDatabase>')

How to monitor resources related to snapshot isolation and row versioning?

Ten dynamic management views are available to help you monitor the activities of the row versioning framework in SQL 2005. Start here.

Friday, September 01, 2006

Optimistic concurrency (with essential corrections)

Optimistic concurrency rests on somewhat wobbly foundations – the presumption that no user will ever be blocked by another user. Which is usually true for single-user solutions while multi-user solutions require a slight correction to this logic. Put another way: with optimistic concurrency we assume that a particular row will only be modified by one user at any given time, and being fully optimistic we do nothing to prevent different users from trying to modify the same row.

(I must admit that the subject of this post has been discussed pretty much everywhere up to the point of almost being disgusting, but I've had this sample laying around for a while and now I've decided on moving it somewhere else rather than permanently removing it from my disk.)

Of course not doing anything to prevent what seems quite probable is not optimism – it's stupidity. On the other hand doing too much (like introducing a sophisticated row-locking mechanism) is also hardly a good idea, especially since elementary cuncurrency management mechanisms already are part of the platform. Basically, what the client application really needs to know before attempting a data modification is:

Whether a row already exists (when inserting data); and
Whether the data has been modified between the time the local set of data has been retrieved and the time when an attempt is made to commit the changes (when updating and/or deleting data).

Inserting duplicate data is prevented through the implementation of the natural key – a combination of one or more attributes that uniquely identify each entity. If the client application also implements the natural key, then an insert will not even be attempted.

The update is where pure optimism can lead to problems. And the same goes for delete operations. I believe an illustration might help here:

Jim and Bob both work for a car manufacturer. Their job is to manage the bill of materials – keeping a current inventory list of all parts for a particular car model this manufacturer produces.

Today their task is simple: the car's breaking system has been changed slightly by the engineers and as a consequence the bolt with the ID number of B12/06 has been replaced, so now the inventory needs to be modified to reflect the change.

Jim and Bob work well together – they are a true team. However, they have a lot of work to do and sometimes don't have time to coordinate their activities perfectly, and pretty much the same is true for Tom, their boss. Which is not critical as long as they are able to identify any flaws in the organisation of their work and correct them accordingly. (The fact of the matter is that most human activities in general do not have to run perfectly smoothly all the time for the job to get done correctly and in time. Long story short – look up these words in your favorite dictionary: "politics", "strategy" and "tactics".)

Back to our example. Bob has been in the company longer than Jim and he's got a bit more experience, further more – his cousin Hal is an engineer. Bob decides on going straight to the source to find out exactly what changes have been made to the breaking system so that he can modify the inventory list appropriately.

Roughly at the same time Jim passes Tom in the hallway and Tom reminds him of the changes to the car's breaking system and Jim decides to go straight up to the chief engineer's office to get the details. He knows how important breaks are for the safety of the car and – more importantly – the passengers.

Anyway, both Jim and Bob get the information they need, and without missing a beat proceed to fill in the changes. Jim goes back to his office, while Bob simply takes out his PDA and logs on. Jim had the chance to look over the blue prints and he knows the new bolt is a 5x20, Bob talked to Hal and Hal showed him the new bolts so Bob knows the new bolt is a 5x25.

At this very moment both Jim and Bob see the details of the old breaking system – the old bolt is a 3x20, obviously completely wrong for the new breaks, and the last time this value has been changed corresponds to the time before the breaking system has been changed.

Now, in a true contemporary-Hollywood-sci-fi manner time slows down to a halt. This gives us time for a few questions:

Should the longer bolt be used or the shorter one?
Should Jim wait for Bob?
Should Bob go back to his office and talk to Jim?
Should Tom be more strict and possibly supervise his subordinates constantly?
Should Hal be forbidden to talk to the members of the inventory department?
Should the chief engineer never leave his office?

All valid questions. And all completely irrelevant.

The real question is:

Who should be the one to notice the conflicting values – Jim, Bob, Tom, Hal or the owner of a brand new car driving down a steep, slippery hill desperately trying to keep the car on the road?

With optimistic concurrency both Jim and Bob are free to modify the data. The platform itself will keep one of the conflicting transactions on hold (locked out) until the ongoing transaction is completed (either committed or rolled back) if by chance both Jim and Bob attempt to modify the row at the same time. Either way – only one value will eventually exist in the database, and both Jim and Bob will be confident in thinking the correct value was saved and the inventory list is in perfect order. That is, if we blindly follow the wobbly principle of optimistic concurrency mentioned above.

Considering that time stands still we even have time for a rhetorical question: could such obvious conflicts be prevented by an application? Maybe even a philosophical one: can computers think?

Let's focus on the pure mechanics of the situation:

Jim wants to replace the value of 3x20 with the new value of 5x20.
Bob wants to replace the value of 3x20 with the new value of 5x25.

This means that whoever of them commits the modification first will cause the other to be unable to modify the value in accordance with the logical rule above – changing 5x20 to 5x25 or 5x25 to 5x20 was never intended and should not occur. Can it be prevented?

Jim and Bob are intelligent beings: they've both identified the value of 3x20 to be the old one that needs to be replaced; even if all they've actually done is look at the data available in the inventory. They both have the *before* picture, and can imagine the correct outcome of the modification (based on the information they have). The trouble is that – since the server cannot consume both modifications at the same time – one of the guys inevitably ends up looking at the wrong *before* picture: the first successfully committed modification will have changed it. In addition – neither Jim nor Bob are even aware of the other's attempt to modify data. Should they be? Not with optimistic concurrency – attempts are irrelevant.

One possible solution springs to mind – the application should detect the change! But could it? It could detect the change after it has been committed and only after an appropriate mechanism has subsequently notified the application of this change (e.g. by utilizing the SqlDependency Class introduced in .Net 2.0), the attempt itself could not be detected. This, of course, may just as well be too late, because the conflict of two (or more) users does not actually occur at the application tier – it occurs deeper down in this architectural wonder we call a multi-tiered system. Can the discrepancy be detected at the data layer? Well, yes, it can. With a little help from an internal attribute that designates a point in time (I'm using the term "time" here as a reference to an imminent natural occurrence we can all observe but not influence).

A very popular method is to use a timestamp column.

Even after all these years there still seems to be quite a lot of confusion among users regarding the timestamp data type and its implementation in SQL Server.

First of all, the timestamp data type in SQL Server has nothing to do with storing date or time (contrary to the SQL Standard specification of timestamp):

timestamp <> datetime

Second, the behaviour of timestamp is not guaranteed in future releases and using the rowversion data type synonym in data definition language (DDL) statements is suggested. Personally, I even prefer the name – it suits the actual functionality.

How does rowversion work and how does it apply to our example?

The value of the rowversion column changes automatically every time one or more columns are updated. Of course the user does not have to be aware of this column, it's best never to propagate the rowversion values to the user interface at all – they mean nothing to the user. Keeping track of the user who last changed the row and of the time the row was last changed is usually added to provide more user-friendly information on data changes. In the data layer these two attributes could either be managed via a trigger or in the data modification procedures. They serve the purpose of making data modifications more transparent to the user.

In our example the implementation of the rowversion column could be used to prevent concurrency issues and to warn the user of the fact that the value has been changed in the time between the moment of data retrieval and the moment of the attempted modification.

Let's now in true sci-fi fashion unpause the universe and let Bob change the dimensions of bolt number B12/06. Just as he clicks Save on his PDA and successfully commits the transaction, back in the office Jim decides to do the same. But to his surprise the server rejects Jim's update and the application responds with a "concurrency violation" warning. He decides on refreshing the data and immediately notices that the value has been changed by Bob quite recently. He also notices a discrepancy between the new value and the value he'd expected to see, which – if nothing else – serves as a pretty solid foundation for what we humans do best – solve problems.

How was this achieved? Well, let's repeat the two transactions.

After creating our playground using this script open a new connection in your favourite SQL querying tool.

/* Connection 1

Bob's task:

Change B12/06 from 3x20 to 5x25

*/

-- Let's get the current data for Bob:

select dbo.Bolts.BoltNumber
 ,dbo.Bolts.BoltType
 ,dbo.Bolts.LastChangedBy
 ,dbo.Bolts.LastChangedOn
 ,dbo.Bolts.RV
 from dbo.Bolts
 where (dbo.Bolts.BoltNumber = 'B12/06')

-- Copy the RV value. We'll need it later.

In a new connection run the following:

/* Connection 2

Jim's task:

Change B12/06 from 3x20 to 5x20

*/

-- Let's get the current data for Jim:
-- Yes, Jim sees what Bob sees.

select dbo.Bolts.BoltNumber
 ,dbo.Bolts.BoltType
 ,dbo.Bolts.LastChangedBy
 ,dbo.Bolts.LastChangedOn
 ,dbo.Bolts.RV
 from dbo.Bolts
 where (dbo.Bolts.BoltNumber = 'B12/06')

-- You've remembered the RV value, right?
-- We don't need it yet. ;)

Return to connection 1 and let Bob update the row. Do not commit the transaction yet – we'll take the opportunity to check something later on.

/* Connection 1 */
declare @rv binary(8)

set @rv = --paste the RV value here

begin tran

update dbo.Bolts
 set dbo.Bolts.BoltType = '5x25'
  ,dbo.Bolts.LastChangedBy = 'Bob'
  ,dbo.Bolts.LastChangedOn = getdate()
 where (dbo.Bolts.BoltNumber = 'B12/06')
  and (dbo.Bolts.RV = @rv)

if (@@rowcount > 0)
 begin
  print 'No failure'
 end
else
 begin
  print 'Concurrency violation'
 end
go

As expected Bob successfully changes the column.

(1 row(s) affected)
No failure

Let's open a third connection and check a few things.

/* Connection 3*/

There's no point in running this query as it will run indefinitely, because Bob's transaction is still pending:

select *
 from dbo.Bolts

Of course in a real-life situation the transaction would have been committed by now.

Note that the query is not blocked in a SQL 2005 database with the READ_COMMITTED_SHAPSHOT setting turned ON. But that's another story.

The next query is popular with many users that are new to SQL, however, it's only ever useful for reporting purposes – when the report needs to be done quickly rather than reflect actual values:

select *
 from dbo.Bolts with(nolock)

The next query can be used to retrieve rows that are not currently locked by a transaction, which is useful for retrieval operations of individual (unlocked) rows while its use for data aggregation will most probably still yield useless results. But sometimes promptness supersedes accuracy:

select *
 from dbo.Bolts with(readpast)

All table hints are explained here.

Now this next query would certainly satisfy several newsgroup posters asking for a way to navigate the restless sea of locks. It offers a view into the logic behind locks, but that's pretty much all it's good for:

select NoLockBolts.*
 from dbo.Bolts NoLockBolts with(nolock)
 where (not exists (
   select *
    from dbo.Bolts ReadPastBolts with(readpast)
    where (NoLockBolts.BoltNumber = ReadPastBolts.BoltNumber)
   ))

The query returns a list of locked rows. Looks useful, but isn't really.

In SQL Server 2005 (if the database allows snapshot isolation) Jim could be allowed to access data in a snapshot – uninhibitted by Bob's transactions:

-- Set snapshot isolation level
set transaction isolation level snapshot

select *
 from dbo.Bolts

-- Set default isolation level
set transaction isolation level read committed

But that's not what this post is about. We'll have to discuss the implications of snapshot transaction isolation another time.

Let's allow Bob to end his transaction.

/* Connection 1 */
commit tran

If you like you can return to connection 3 and try the queries again.

Now let's give Jim at connection 2 a chance to do his job (or at least try).

/* Connection 2 */
declare @rv binary(8)

set @rv = --paste the RV value here (not the new one - the old one!)

begin tran

update dbo.Bolts
 set dbo.Bolts.BoltType = '5x20'
  ,dbo.Bolts.LastChangedBy = 'Jim'
  ,dbo.Bolts.LastChangedOn = getdate()
 where (dbo.Bolts.BoltNumber = 'B12/06')
  and (dbo.Bolts.RV = @rv)

if (@@rowcount > 0)
 begin
  print 'No failure'
 end
else
 begin
  print 'Concurrency violation'
  -- Now Jim knows that the data has been changed.
 end

What happens at the car manufacturer next would probably be quite interesting to see, but I'll leave that to your imagination. As far as the information management part is concerned by utilising the rowversion column we have not allowed optimism obscure data discrepancies and potentially lead to undesired consequences, but rather to guide data modifications by keeping a logical reference to the state a particular row of data was in prior to the modification attempt.

The rowversion value in this example helped us enforce the two rules mentioned above; based on the information available to both users before attempting modifications the logic was simple: the initial state of the row was fixed in time (logically, not physically) and even though there was a discrepancy between the expected results, both users had to have been treated equally in their potential to actually modify the value (which is the central principle of optimistic concurrency). The server had to follow a very elementary rule: "change A to B", and after A no longer existed the server had to prevent the violation of the rule. Neither the application nor the server "knew" the logical rule in detail (e.g. the initial column values were never compared at the server, only the rowversion value was used as a reference to a point in time in which the user's decision to change the value originated); the computer didn't think – the thinking was left to humans.

What was at work here?

Let's take another look at the principal feature of optimistic concurrency: any user can change any row, since we presume that only a single user at any given time will ever attempt to modify a row.

The above principle needs to be corrected if we consider the probability that the above presumption is violated. And we *should* consider it, especially in a multi-user environment. The correction is: any user can change any row as long as the timeline has not been broken. In other words: any user can modify a row of data if no other user has changed it between the time the first user has retrieved data and the time he attempted the modification.

Since this no longer constitutes *pure* optimism could we then call it *optimal* optimism from now on?

I've mentioned SqlDependency (a .Net 2.0 class of the System.Data.SqlClient namespace) as a way for the application to track changes to the data layer. This functionality may reduce the occurence of concurrency issues even further, but it cannot prevent two or more users attempting to update the same row of data at the same time – conflicts like these must be resolved at the data layer. Why? The situation itself is at its most critical stage at the very moment cuncurrent user actions collide, and this happens on the server, not in the application, so why would the application have to handle them? Nevertheless, if the application has the possibility to reduce the number of (potential) collisions then it should do so, by all means. Well, as long as the corrected principle of optimistic concurrency remains in effect.