snaps & snippets: EXISTS or NOT EXISTS... that is the question here

Whether 'tis nobler to count the rows in tables
And stall the traffic with outrageous pride
Or to merely stop as early as a single row
Fullfills the expectations; To exist: to be found.

Has Hamlet ever designed a data driven solution?

How to check whether one or more rows corresponding to a given set of conditions exist? It should be common knowledge by now that this

if ((
 select count(*)
  from ...
  where ...
 ) > 0)
 begin
  -- logic if rows exist
 end
else
 begin
  -- logic if rows don't exist
 end

looks like the answer, but is not, or at least *might* not be. Nonetheless, methods like this one still manage to creep into production code now and then. Is that all bad? Keep reading – you may be surprised.

Anyway, in T-SQL there's a far more efficient method of checking whether rows exist, and we'll take a look at how efficient it actually is. And perhaps learn something new. Start by creating this table in your favourite SQL 2000 (yes, that's 2K) "testing-ground" database (the script assumes that [SQL2005] is the name of your SQL 2005 server where the AdventureWorks database resides). The only index included in the script is a clustered primary key on the identity column.

`EXISTS`

The expression is described in Books Online, along with some typical examples of use. The example above can thus be rewritten as:

if (exists (
  select *
   from ...
   where ...
  ))
 begin
  -- logic if rows exist
 end
else
 begin
  -- logic if rows don't exist
 end

What's the benefit? First of all, the query is now semantically correct: instead of counting the rows to test whether there is at least one that corresponds to the search criteria, we simply query the database engine for their existence. The more important aspect is the performance gain: queries using the EXISTS expression outperform those using the "if-count-greater-than-zero" alternative. Well, they do in SQL 2000.

The example queries are available at the bottom of the post – follow the links...

Query	SQL 2000 Performance
	Clustered Primary Key
	CPU	Reads
#1	15	118
#2	0	7

118 reads vs. 7 (seven!) against a table with 31465 rows in total (30086 of those correspond to the criteria). The EXISTS expression performs approximately 16 times faster than the one based on the row count.

How does SQL Server process an EXISTS expression? To put it as simply as possible: data processing stops as soon as the condition is met, and only as much data is actually read as is needed to test whether the condition is satisfied by *any* row since no rows need to be returned to the client. This means that if an appropriate index exists, the database engine (theoretically) only needs to access a single index page.

Query	SQL 2000 Performance
	Covering Index
	CPU	Reads
#1	15	92
#2	0	2

With a covering index we're down to 2 reads for the EXISTS expression. The performance of the "if COUNT > 0" method is also slightly improved by the covering index as all processing can be done on index pages.

* or 1 or what?

There are three methods of forming the EXISTS query that can be observed in practice:

exists (select * ...)

exists (select 1 ...)

exists (select %table_primary_key% ...)

Each of the above examples comes with a set of at least three users vigorously defending theirs as the most correct of all. The fact remains that the SELECT statement of the EXISTS expression does not affect the execution plan at all. IMHO, option #3 is the least sensible but still there's only one reason why it shouldn't be used: the query would cause errors if the table's primary key column(s) were renamed.

According to an urban myth option #1 is supposed to be the only correct one as "it allows the optimizer to choose the most appropriate index". This is perfecly unfounded as all three examples produce the same execution plan. At least in Microsoft SQL Server. => Busted!

With table hints the optimizer can be "guided" (=forced) into using a specific index. However, the use of index hints should be revised when (perhaps more appropriate) indexes are added or when the schema changes or even when the table usage changes.

Query	SQL 2000 Performance
	Clustered Index Scan			Nonclustered Index Seek ^{(Covering Index)}
	Reads	Estimated Row Size ^(Bytes)	Estimated I/O Cost	Reads	Estimated Row Size ^(Bytes)	Estimated I/O Cost
#3	7	45	0,0032066	2	45	0,0032041
#4	8	45	0,0032066	2	45	0,0032041

Query	SQL 2005 Performance
	Clustered Index Scan			Nonclustered Index Seek ^{(Covering Index)}
	Reads	Estimated Row Size ^(Bytes)	Estimated I/O Cost	Reads	Estimated Row Size ^(Bytes)	Estimated I/O Cost
#3	7	15	0,086088	2	9	0,0668287
#4	8	23	0,086088	2	15	0,0668287

It's safe to assume that an appropriate nonclustered index will be the optimizer's preferred choice. In fact, covering indexes once again take the gold.

But now, here's a bit of a surprise...

SQL Server 2005 knows about "if-count-greater-than-zero"

In SQL 2005 another performance improvement becomes apparent, sadly only when querying with the "COUNT > 0" or the "COUNT >= 1" condition.

Query	SQL 2005 Performance
	Clustered Primary Key		Covering Index
	CPU	Reads	CPU	Reads
#1	0	7	0	2
#2	0	7	0	2
#5	0	7	0	2
#6	20	115	15	89

What?! Has counting rows been improved this much in SQL 2005? Nope. Query #6 tests whether there are more rows than 1 and the behaviour is back to expected. Let's compare execution plans:

The expected behaviour (COUNT in SQL 2000);

(using the clustered primary key):

|--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1002]>0) then 1 else 0))
     |--Nested Loops(Inner Join)
          |--Constant Scan
          |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1013])))
               |--Stream Aggregate(DEFINE:([Expr1013]=Count(*)))
                    |--Filter(WHERE:([SalesData].[OrderDate]<=getdate()))
                         |--Clustered Index Scan(OBJECT:(
                            [TestingGround].[dbo].[SalesData].[pk_SalesData_SalesDataId]),
                            WHERE:([SalesData].[OrderDate]>='Jan  1 2002 12:00AM'))

or (using the nonclustered covering index):

|--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1002]>=1) then 1 else 0))
     |--Nested Loops(Inner Join)
          |--Constant Scan
          |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1013])))
               |--Stream Aggregate(DEFINE:([Expr1013]=Count(*)))
                    |--Index Seek(OBJECT:(
                       [TestingGround].[dbo].[SalesData].[x_SalesData_OrderDate_TotalDue]),
                       SEEK:([SalesData].[OrderDate] >= 'Jan  1 2002 12:00AM'
                       AND [SalesData].[OrderDate] <= getdate()) ORDERED FORWARD)

The surprising improvement (COUNT > 0, COUNT >= 1 and EXISTS in SQL 2005);

(using the clustered primary key):

|--Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1006] THEN (1) ELSE (0) END))
     |--Nested Loops(Left Semi Join, DEFINE:([Expr1006] = [PROBE VALUE]))
          |--Constant Scan
          |--Filter(WHERE:([TestingGround].[dbo].[SalesData].[OrderDate]<=getdate()))
               |--Clustered Index Scan(OBJECT:(
                  [TestingGround].[dbo].[SalesData].[pk_SalesData_SalesDataId]),
                  WHERE:([TestingGround].[dbo].[SalesData].[OrderDate]
                  >='2002-01-01 00:00:00.000'))

or (using the nonclustered covering index):

|--Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1006] THEN (1) ELSE (0) END))
     |--Nested Loops(Left Semi Join, DEFINE:([Expr1006] = [PROBE VALUE]))
          |--Constant Scan
          |--Index Seek(OBJECT:(
             [TestingGround].[dbo].[SalesData].[x_SalesData_OrderDate_TotalDue]),
             SEEK:([TestingGround].[dbo].[SalesData].[OrderDate] >= '2002-01-01 00:00:00.000'
             AND [TestingGround].[dbo].[SalesData].[OrderDate] <= getdate()) ORDERED

Notice the Stream Aggregate operator in the first case and its mysterious absence from the second?

Obviously, SQL Server 2005 "knows" what "if-count-greater-than-zero" and "if-count-at-least-one" mean – the execution plans for "COUNT > 0", "COUNT >= 1" and EXISTS are the same. Further more, setting the compatibility level to 80 or lower does not stop the new behaviour. Is this the result of an intended improvement or merely a side-effect of some other improvement?

The world keeps on changing

There certainly is strength in the EXISTS expression, and starting with SQL 2005 there appears to be strength in methods previously thought of as weak. At least this is true of the two presented here.

P.S. If you're looking for the exist XML method, you can find it right here.

^* Queries used in the demonstration:

Query #1:

if ((
 select count(*)
  from dbo.SalesData
  where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
 ) > 0)
 begin
  print 1
 end
else
 begin
  print 0
 end

Query #2:

if (exists (
  select *
   from dbo.SalesData
   where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
  ))
 begin
  print 1
 end
else
 begin
  print 0
 end

Covering index DDL script:

create nonclustered index x_SalesData_OrderDate_TotalDue
 on dbo.SalesData
  (
  OrderDate
  ,TotalDue
  )

Query #3:

-- Forcing the use of the clustered primary key:
if (exists (
  select *
   from dbo.SalesData with(index(pk_SalesData_SalesDataId))
   where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
  ))
 begin
  print 1
 end
else
 begin
  print 0
 end

-- Forcing the use of the covering nonclustered index:
if (exists (
  select *
   from dbo.SalesData with(index(x_SalesData_OrderDate_TotalDue))
   where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
  ))
 begin
  print 1
 end
else
 begin
  print 0
 end

Query #4:

-- Forcing the use of the clustered primary key:
if (exists (
  select *
   from dbo.SalesData with(index(pk_SalesData_SalesDataId))
   where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
    and (dbo.SalesData.TotalDue = 3953.9884)
  ))
 begin
  print 1
 end
else
 begin
  print 0
 end

-- Forcing the use of the covering nonclustered index:
if (exists (
  select *
   from dbo.SalesData with(index(x_SalesData_OrderDate_TotalDue))
   where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
    and (dbo.SalesData.TotalDue = 3953.9884)
  ))
 begin
  print 1
 end
else
 begin
  print 0
 end

Query #5:

if ((
 select count(*)
  from dbo.SalesData
  where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
 ) >= 1)
 begin
  print 1
 end
else
 begin
  print 0
 end

Query #6:

if ((
 select count(*)
  from dbo.SalesData
  where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
 ) > 1)
 begin
  print 1
 end
else
 begin
  print 0
 end

snaps & snippets

Tuesday, October 03, 2006

EXISTS or NOT EXISTS... that is the question here

`EXISTS`

* or 1 or what?

SQL Server 2005 knows about "if-count-greater-than-zero"

The world keeps on changing

No comments: