Tuesday, October 03, 2006

EXISTS or NOT EXISTS... that is the question here

Whether 'tis nobler to count the rows in tables
And stall the traffic with outrageous pride
Or to merely stop as early as a single row
Fullfills the expectations; To exist: to be found.

Has Hamlet ever designed a data driven solution?

How to check whether one or more rows corresponding to a given set of conditions exist? It should be common knowledge by now that this

if ((
 select count(*)
  from ...
  where ...
 ) > 0)
 begin
  -- logic if rows exist
 end
else
 begin
  -- logic if rows don't exist
 end

looks like the answer, but is not, or at least *might* not be. Nonetheless, methods like this one still manage to creep into production code now and then. Is that all bad? Keep reading – you may be surprised.

Anyway, in T-SQL there's a far more efficient method of checking whether rows exist, and we'll take a look at how efficient it actually is. And perhaps learn something new. Start by creating this table in your favourite SQL 2000 (yes, that's 2K) "testing-ground" database (the script assumes that [SQL2005] is the name of your SQL 2005 server where the AdventureWorks database resides). The only index included in the script is a clustered primary key on the identity column.


EXISTS

The expression is described in Books Online, along with some typical examples of use. The example above can thus be rewritten as:

if (exists (
  select *
   from ...
   where ...
  ))
 begin
  -- logic if rows exist
 end
else
 begin
  -- logic if rows don't exist
 end

What's the benefit? First of all, the query is now semantically correct: instead of counting the rows to test whether there is at least one that corresponds to the search criteria, we simply query the database engine for their existence. The more important aspect is the performance gain: queries using the EXISTS expression outperform those using the "if-count-greater-than-zero" alternative. Well, they do in SQL 2000.

The example queries are available at the bottom of the post – follow the links...

Query SQL 2000 Performance
Clustered Primary Key
CPU Reads
#1 15 118
#2 0 7

118 reads vs. 7 (seven!) against a table with 31465 rows in total (30086 of those correspond to the criteria). The EXISTS expression performs approximately 16 times faster than the one based on the row count.

How does SQL Server process an EXISTS expression? To put it as simply as possible: data processing stops as soon as the condition is met, and only as much data is actually read as is needed to test whether the condition is satisfied by *any* row since no rows need to be returned to the client. This means that if an appropriate index exists, the database engine (theoretically) only needs to access a single index page.

Query SQL 2000 Performance
Covering Index
CPU Reads
#1 15 92
#2 0 2

With a covering index we're down to 2 reads for the EXISTS expression. The performance of the "if COUNT > 0" method is also slightly improved by the covering index as all processing can be done on index pages.


* or 1 or what?

There are three methods of forming the EXISTS query that can be observed in practice:

exists (select * ...)
exists (select 1 ...)
exists (select %table_primary_key% ...)

Each of the above examples comes with a set of at least three users vigorously defending theirs as the most correct of all. The fact remains that the SELECT statement of the EXISTS expression does not affect the execution plan at all. IMHO, option #3 is the least sensible but still there's only one reason why it shouldn't be used: the query would cause errors if the table's primary key column(s) were renamed.

According to an urban myth option #1 is supposed to be the only correct one as "it allows the optimizer to choose the most appropriate index". This is perfecly unfounded as all three examples produce the same execution plan. At least in Microsoft SQL Server. => Busted!

With table hints the optimizer can be "guided" (=forced) into using a specific index. However, the use of index hints should be revised when (perhaps more appropriate) indexes are added or when the schema changes or even when the table usage changes.

Query SQL 2000 Performance
Clustered Index Scan Nonclustered Index Seek
(Covering Index)
Reads Estimated Row Size
(Bytes)
Estimated I/O Cost Reads Estimated Row Size
(Bytes)
Estimated I/O Cost
#3 7 45 0,0032066 2 45 0,0032041
#4 8 45 0,0032066 2 45 0,0032041

Query SQL 2005 Performance
Clustered Index Scan Nonclustered Index Seek
(Covering Index)
Reads Estimated Row Size
(Bytes)
Estimated I/O Cost Reads Estimated Row Size
(Bytes)
Estimated I/O Cost
#3 7 15 0,086088 2 9 0,0668287
#4 8 23 0,086088 2 15 0,0668287

It's safe to assume that an appropriate nonclustered index will be the optimizer's preferred choice. In fact, covering indexes once again take the gold.

But now, here's a bit of a surprise...


SQL Server 2005 knows about "if-count-greater-than-zero"

In SQL 2005 another performance improvement becomes apparent, sadly only when querying with the "COUNT > 0" or the "COUNT >= 1" condition.

Query SQL 2005 Performance
Clustered Primary Key Covering Index
CPU Reads CPU Reads
#1 0 7 0 2
#2 0 7 0 2
#5 0 7 0 2
#6 20 115 15 89

What?! Has counting rows been improved this much in SQL 2005? Nope. Query #6 tests whether there are more rows than 1 and the behaviour is back to expected. Let's compare execution plans:

  • The expected behaviour (COUNT in SQL 2000);

    (using the clustered primary key):
    |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1002]>0) then 1 else 0))
         |--Nested Loops(Inner Join)
              |--Constant Scan
              |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1013])))
                   |--Stream Aggregate(DEFINE:([Expr1013]=Count(*)))
                        |--Filter(WHERE:([SalesData].[OrderDate]<=getdate()))
                             |--Clustered Index Scan(OBJECT:(
                                [TestingGround].[dbo].[SalesData].[pk_SalesData_SalesDataId]),
                                WHERE:([SalesData].[OrderDate]>='Jan  1 2002 12:00AM'))
    or (using the nonclustered covering index):
    |--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1002]>=1) then 1 else 0))
         |--Nested Loops(Inner Join)
              |--Constant Scan
              |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1013])))
                   |--Stream Aggregate(DEFINE:([Expr1013]=Count(*)))
                        |--Index Seek(OBJECT:(
                           [TestingGround].[dbo].[SalesData].[x_SalesData_OrderDate_TotalDue]),
                           SEEK:([SalesData].[OrderDate] >= 'Jan  1 2002 12:00AM'
                           AND [SalesData].[OrderDate] <= getdate()) ORDERED FORWARD)
  • The surprising improvement (COUNT > 0, COUNT >= 1 and EXISTS in SQL 2005);

    (using the clustered primary key):
    |--Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1006] THEN (1) ELSE (0) END))
         |--Nested Loops(Left Semi Join, DEFINE:([Expr1006] = [PROBE VALUE]))
              |--Constant Scan
              |--Filter(WHERE:([TestingGround].[dbo].[SalesData].[OrderDate]<=getdate()))
                   |--Clustered Index Scan(OBJECT:(
                      [TestingGround].[dbo].[SalesData].[pk_SalesData_SalesDataId]),
                      WHERE:([TestingGround].[dbo].[SalesData].[OrderDate]
                      >='2002-01-01 00:00:00.000'))
    or (using the nonclustered covering index):
    |--Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1006] THEN (1) ELSE (0) END))
         |--Nested Loops(Left Semi Join, DEFINE:([Expr1006] = [PROBE VALUE]))
              |--Constant Scan
              |--Index Seek(OBJECT:(
                 [TestingGround].[dbo].[SalesData].[x_SalesData_OrderDate_TotalDue]),
                 SEEK:([TestingGround].[dbo].[SalesData].[OrderDate] >= '2002-01-01 00:00:00.000'
                 AND [TestingGround].[dbo].[SalesData].[OrderDate] <= getdate()) ORDERED

Notice the Stream Aggregate operator in the first case and its mysterious absence from the second?

Obviously, SQL Server 2005 "knows" what "if-count-greater-than-zero" and "if-count-at-least-one" mean – the execution plans for "COUNT > 0", "COUNT >= 1" and EXISTS are the same. Further more, setting the compatibility level to 80 or lower does not stop the new behaviour. Is this the result of an intended improvement or merely a side-effect of some other improvement?


The world keeps on changing

There certainly is strength in the EXISTS expression, and starting with SQL 2005 there appears to be strength in methods previously thought of as weak. At least this is true of the two presented here.


ML


P.S. If you're looking for the exist XML method, you can find it right here.

* Queries used in the demonstration:

  • Query #1:
    if ((
     select count(*)
      from dbo.SalesData
      where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
     ) > 0)
     begin
      print 1
     end
    else
     begin
      print 0
     end
  • Query #2:
    if (exists (
      select *
       from dbo.SalesData
       where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
      ))
     begin
      print 1
     end
    else
     begin
      print 0
     end
  • Covering index DDL script:
    create nonclustered index x_SalesData_OrderDate_TotalDue
     on dbo.SalesData
      (
      OrderDate
      ,TotalDue
      )
  • Query #3:
    -- Forcing the use of the clustered primary key:
    if (exists (
      select *
       from dbo.SalesData with(index(pk_SalesData_SalesDataId))
       where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
      ))
     begin
      print 1
     end
    else
     begin
      print 0
     end
    
    -- Forcing the use of the covering nonclustered index:
    if (exists (
      select *
       from dbo.SalesData with(index(x_SalesData_OrderDate_TotalDue))
       where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
      ))
     begin
      print 1
     end
    else
     begin
      print 0
     end
  • Query #4:
    -- Forcing the use of the clustered primary key:
    if (exists (
      select *
       from dbo.SalesData with(index(pk_SalesData_SalesDataId))
       where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
        and (dbo.SalesData.TotalDue = 3953.9884)
      ))
     begin
      print 1
     end
    else
     begin
      print 0
     end
    
    -- Forcing the use of the covering nonclustered index:
    if (exists (
      select *
       from dbo.SalesData with(index(x_SalesData_OrderDate_TotalDue))
       where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
        and (dbo.SalesData.TotalDue = 3953.9884)
      ))
     begin
      print 1
     end
    else
     begin
      print 0
     end
  • Query #5:
    if ((
     select count(*)
      from dbo.SalesData
      where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
     ) >= 1)
     begin
      print 1
     end
    else
     begin
      print 0
     end
  • Query #6:
    if ((
     select count(*)
      from dbo.SalesData
      where (dbo.SalesData.OrderDate between '2002-01-01' and getdate())
     ) > 1)
     begin
      print 1
     end
    else
     begin
      print 0
     end

No comments: