Whether 'tis nobler to count the rows in tables
And stall the traffic with outrageous pride
Or to merely stop as early as a single row
Fullfills the expectations; To exist: to be found.
Has Hamlet ever designed a data driven solution?
How to check whether one or more rows corresponding to a given set of conditions exist? It should be common knowledge by now that this
if (( select count(*) from ... where ... ) > 0) begin -- logic if rows exist end else begin -- logic if rows don't exist end
looks like the answer, but is not, or at least *might* not be. Nonetheless, methods like this one still manage to creep into production code now and then. Is that all bad? Keep reading – you may be surprised.
Anyway, in T-SQL there's a far more efficient method of checking whether rows exist, and we'll take a look at how efficient it actually is. And perhaps learn something new. Start by creating this table in your favourite SQL 2000 (yes, that's 2K) "testing-ground" database (the script assumes that [SQL2005] is the name of your SQL 2005 server where the AdventureWorks database resides). The only index included in the script is a clustered primary key on the identity column.
EXISTS
The expression is described in Books Online, along with some typical examples of use. The example above can thus be rewritten as:
if (exists ( select * from ... where ... )) begin -- logic if rows exist end else begin -- logic if rows don't exist end
What's the benefit? First of all, the query is now semantically correct: instead of counting the rows to test whether there is at least one that corresponds to the search criteria, we simply query the database engine for their existence. The more important aspect is the performance gain: queries using the EXISTS
expression outperform those using the "if-count-greater-than-zero" alternative. Well, they do in SQL 2000.
The example queries are available at the bottom of the post – follow the links...
Query | SQL 2000 Performance | |
---|---|---|
Clustered Primary Key | ||
CPU | Reads | |
#1 | 15 | 118 |
#2 | 0 | 7 |
118 reads vs. 7 (seven!) against a table with 31465 rows in total (30086 of those correspond to the criteria). The EXISTS
expression performs approximately 16 times faster than the one based on the row count.
How does SQL Server process an EXISTS
expression? To put it as simply as possible: data processing stops as soon as the condition is met, and only as much data is actually read as is needed to test whether the condition is satisfied by *any* row since no rows need to be returned to the client. This means that if an appropriate index exists, the database engine (theoretically) only needs to access a single index page.
Query | SQL 2000 Performance | |
---|---|---|
Covering Index | ||
CPU | Reads | |
#1 | 15 | 92 |
#2 | 0 | 2 |
With a covering index we're down to 2 reads for the EXISTS
expression. The performance of the "if COUNT > 0
" method is also slightly improved by the covering index as all processing can be done on index pages.
* or 1 or what?
There are three methods of forming the EXISTS
query that can be observed in practice:
exists (select * ...)
exists (select 1 ...)
exists (select %table_primary_key% ...)
Each of the above examples comes with a set of at least three users vigorously defending theirs as the most correct of all. The fact remains that the SELECT
statement of the EXISTS
expression does not affect the execution plan at all. IMHO, option #3 is the least sensible but still there's only one reason why it shouldn't be used: the query would cause errors if the table's primary key column(s) were renamed.
According to an urban myth option #1 is supposed to be the only correct one as "it allows the optimizer to choose the most appropriate index". This is perfecly unfounded as all three examples produce the same execution plan. At least in Microsoft SQL Server. =>
Busted!
With table hints the optimizer can be "guided" (=forced) into using a specific index. However, the use of index hints should be revised when (perhaps more appropriate) indexes are added or when the schema changes or even when the table usage changes.
Query | SQL 2000 Performance | |||||
---|---|---|---|---|---|---|
Clustered Index Scan | Nonclustered Index Seek (Covering Index) |
|||||
Reads | Estimated Row Size (Bytes) |
Estimated I/O Cost | Reads | Estimated Row Size (Bytes) |
Estimated I/O Cost | |
#3 | 7 | 45 | 0,0032066 | 2 | 45 | 0,0032041 |
#4 | 8 | 45 | 0,0032066 | 2 | 45 | 0,0032041 |
Query | SQL 2005 Performance | |||||
---|---|---|---|---|---|---|
Clustered Index Scan | Nonclustered Index Seek (Covering Index) |
|||||
Reads | Estimated Row Size (Bytes) |
Estimated I/O Cost | Reads | Estimated Row Size (Bytes) |
Estimated I/O Cost | |
#3 | 7 | 15 | 0,086088 | 2 | 9 | 0,0668287 |
#4 | 8 | 23 | 0,086088 | 2 | 15 | 0,0668287 |
It's safe to assume that an appropriate nonclustered index will be the optimizer's preferred choice. In fact, covering indexes once again take the gold.
But now, here's a bit of a surprise...
SQL Server 2005 knows about "if-count-greater-than-zero"
In SQL 2005 another performance improvement becomes apparent, sadly only when querying with the "COUNT > 0
" or the "COUNT >= 1
" condition.
Query | SQL 2005 Performance | |||
---|---|---|---|---|
Clustered Primary Key | Covering Index | |||
CPU | Reads | CPU | Reads | |
#1 | 0 | 7 | 0 | 2 |
#2 | 0 | 7 | 0 | 2 |
#5 | 0 | 7 | 0 | 2 |
#6 | 20 | 115 | 15 | 89 |
What?! Has counting rows been improved this much in SQL 2005? Nope. Query #6 tests whether there are more rows than 1 and the behaviour is back to expected. Let's compare execution plans:
- The expected behaviour (
COUNT
in SQL 2000);
(using the clustered primary key):|--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1002]>0) then 1 else 0)) |--Nested Loops(Inner Join) |--Constant Scan |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1013]))) |--Stream Aggregate(DEFINE:([Expr1013]=Count(*))) |--Filter(WHERE:([SalesData].[OrderDate]<=getdate())) |--Clustered Index Scan(OBJECT:( [TestingGround].[dbo].[SalesData].[pk_SalesData_SalesDataId]), WHERE:([SalesData].[OrderDate]>='Jan 1 2002 12:00AM'))
or (using the nonclustered covering index):|--Compute Scalar(DEFINE:([Expr1004]=If ([Expr1002]>=1) then 1 else 0)) |--Nested Loops(Inner Join) |--Constant Scan |--Compute Scalar(DEFINE:([Expr1002]=Convert([Expr1013]))) |--Stream Aggregate(DEFINE:([Expr1013]=Count(*))) |--Index Seek(OBJECT:( [TestingGround].[dbo].[SalesData].[x_SalesData_OrderDate_TotalDue]), SEEK:([SalesData].[OrderDate] >= 'Jan 1 2002 12:00AM' AND [SalesData].[OrderDate] <= getdate()) ORDERED FORWARD)
- The surprising improvement (
COUNT > 0
,COUNT >= 1
andEXISTS
in SQL 2005);
(using the clustered primary key):|--Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1006] THEN (1) ELSE (0) END)) |--Nested Loops(Left Semi Join, DEFINE:([Expr1006] = [PROBE VALUE])) |--Constant Scan |--Filter(WHERE:([TestingGround].[dbo].[SalesData].[OrderDate]<=getdate())) |--Clustered Index Scan(OBJECT:( [TestingGround].[dbo].[SalesData].[pk_SalesData_SalesDataId]), WHERE:([TestingGround].[dbo].[SalesData].[OrderDate] >='2002-01-01 00:00:00.000'))
or (using the nonclustered covering index):|--Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1006] THEN (1) ELSE (0) END)) |--Nested Loops(Left Semi Join, DEFINE:([Expr1006] = [PROBE VALUE])) |--Constant Scan |--Index Seek(OBJECT:( [TestingGround].[dbo].[SalesData].[x_SalesData_OrderDate_TotalDue]), SEEK:([TestingGround].[dbo].[SalesData].[OrderDate] >= '2002-01-01 00:00:00.000' AND [TestingGround].[dbo].[SalesData].[OrderDate] <= getdate()) ORDERED
Notice the Stream Aggregate
operator in the first case and its mysterious absence from the second?
Obviously, SQL Server 2005 "knows" what "if-count-greater-than-zero" and "if-count-at-least-one" mean – the execution plans for "COUNT > 0
", "COUNT >= 1
" and EXISTS
are the same. Further more, setting the compatibility level to 80 or lower does not stop the new behaviour. Is this the result of an intended improvement or merely a side-effect of some other improvement?
The world keeps on changing
There certainly is strength in the EXISTS
expression, and starting with SQL 2005 there appears to be strength in methods previously thought of as weak. At least this is true of the two presented here.
ML
P.S. If you're looking for the exist
XML method, you can find it right here.
* Queries used in the demonstration:
- Query #1:
if (( select count(*) from dbo.SalesData where (dbo.SalesData.OrderDate between '2002-01-01' and getdate()) ) > 0) begin print 1 end else begin print 0 end
- Query #2:
if (exists ( select * from dbo.SalesData where (dbo.SalesData.OrderDate between '2002-01-01' and getdate()) )) begin print 1 end else begin print 0 end
- Covering index DDL script:
create nonclustered index x_SalesData_OrderDate_TotalDue on dbo.SalesData ( OrderDate ,TotalDue )
- Query #3:
-- Forcing the use of the clustered primary key: if (exists ( select * from dbo.SalesData with(index(pk_SalesData_SalesDataId)) where (dbo.SalesData.OrderDate between '2002-01-01' and getdate()) )) begin print 1 end else begin print 0 end -- Forcing the use of the covering nonclustered index: if (exists ( select * from dbo.SalesData with(index(x_SalesData_OrderDate_TotalDue)) where (dbo.SalesData.OrderDate between '2002-01-01' and getdate()) )) begin print 1 end else begin print 0 end
- Query #4:
-- Forcing the use of the clustered primary key: if (exists ( select * from dbo.SalesData with(index(pk_SalesData_SalesDataId)) where (dbo.SalesData.OrderDate between '2002-01-01' and getdate()) and (dbo.SalesData.TotalDue = 3953.9884) )) begin print 1 end else begin print 0 end -- Forcing the use of the covering nonclustered index: if (exists ( select * from dbo.SalesData with(index(x_SalesData_OrderDate_TotalDue)) where (dbo.SalesData.OrderDate between '2002-01-01' and getdate()) and (dbo.SalesData.TotalDue = 3953.9884) )) begin print 1 end else begin print 0 end
- Query #5:
if (( select count(*) from dbo.SalesData where (dbo.SalesData.OrderDate between '2002-01-01' and getdate()) ) >= 1) begin print 1 end else begin print 0 end
- Query #6:
if (( select count(*) from dbo.SalesData where (dbo.SalesData.OrderDate between '2002-01-01' and getdate()) ) > 1) begin print 1 end else begin print 0 end
No comments:
Post a Comment