Friday, January 11, 2008

What's wrong with COUNT(*)?

Counting – what's it all about? I'm sure you've heard about a trick that one should use when aggregating data using the COUNT system function in Microsoft SQL Server; specifically, how the performance of the function can be improved using a neat trick. But can we go beyond neat...?

The tests have been done on a SQL Server 2005 (SP2) instance; you're all kindly invited to also run them on an instance/version of your own choice.


Send in the rows

Let's start with some test data. I've provided a script that will do that for you, provided you have a testing database – which I'm sure you do (yes, tempdb is just fine).

The script creates a five column table and copies data into it from the AdventureWorks database, which I assume is available on the same server instance. If any of the above is not true in your particular case, you can still adapt the script as needed.

We end up with a brand new table containing 19,972 rows of data residing in 5,197 data pages.


The plan

We'll be counting rows (what a shock):

  • Query #1 – using no restrictions (i.e. counting all rows);
    select count(*)
     from dbo.Contact
    and
  • Query #2 – using a restriction (i.e. counting rows corresponding to specific criteria).
    select count(*)
     from dbo.Contact
     where (dbo.Contact.FirstName = N'Karl')

Get on with it!

If we execute the two queries above on our heap (remember, this new table has no keys and no indexes), we get these results:

Query Performance
(on a heap)
Result
(row count)
CPU Reads
#1 76 5203 19972
#2 62 5203 22

Also take a look at the execution plans (Query #1 execution plan, Query #2 execution plan). Nothing too impressive, but understandable: without any indexes, the optimizer needs to access all the pages in order to get the actual row count. With Query #1 the optimizer has no options but to do a full table scan, and the restriction in Query #2 can't really improve the performance as it requires data to be tested in order to restrict the result set.

Now, let's add the primary key. We'll use the ContactID column and also make the primary key clustered (which is the default anyway, but I prefer explicit declarations):

alter table dbo.Contact
 add constraint pk_Contact
   primary key clustered
   (
   ContactID
   )
Query Performance
(clustered primary key)
Result
(row count)
CPU Reads
#1 34 5208 19972
#2 68 5208 22

Ha! The number of reads has actually increased. Note that the clustered primary key actually takes up as many pages as the data itself. On the other hand, for Query #1 the stress on the CPU has decreased, while for Query #2 it seems that the clustered primary key can't really do much – data must be read and tested, and the referenced column is not part of the primary key.

Do you think making it part of the clustered primary key might make a difference? Do you think it would make a favorable difference? I've decided on not including the latter in this test as it affects the data model, and modifying the data model (especially, changing a primary key) just to improve performance is IMO far from being good practice. If you're curious enough you can do it yourself.1

Anyway, as expected (and quite imminent as well), the execution plans are different now (Query #1 execution plan, Query #2 execution plan) compared to before.

Let's see what happens if we add an index. Let's index the column used in the restriction of Query #2 – the FirstName column:

create nonclustered index x_Contact_FirstName
 on dbo.Contact
  (
  FirstName
  )
Query Performance
(cl. PK + noncl. FirstName)
Result
(row count)
CPU Reads
#1 4 59 19972
#2 0 2 22

Well, well, who'd have thought...? The optimizer once again stands up to its name. Let's examine the execution plans (Query #1 execution plan, Query #2 execution plan). No longer is the use of the clustered primary key the optimal choice – the nonclustered index requires far less reading. After all, it resides on only 57 pages, and the optimizer takes full advantage of this fact – on both occasions.


Some Romans fear Asterix...

..others fear the asterisk like the plague. However, if we consider the fact that

select count(*)

simply is not in any way functionally equivalent to

select *

then this fear loses most of its substance. I will not go into any details this time of when and why not declaring columns in the SELECT clause is poor practice.

For comparison let's execute these two queries:

  • Query #4 – retrieving all rows (and all columns);
    select *
     from dbo.Contact
    and
  • Query #5 – retrieving a subset of rows.
    select *
     from dbo.Contact
     where (dbo.Contact.FirstName = N'Karl')

The execution plans:

If we now compare the execution plans of these last two queries with the execution plans above, we can observe one significant difference. When viewing the execution plan in SSMS park the mouse over the arrows leading from the very first operator (Table Scan or Clustered/Nonclustered Index Scan/Seek). What can we observe?

Simply counting rows does not require any data to be read from the data pages (except when values need to be tested in order to apply query restrictions).


What about that "neat trick"?

A proposed method of speeding up row counts that has been suggested by some members of the community from time to time, is to replace the asterisk as the argument of the COUNT function with a constant. This – believe it or not – is supposed to lower the I/O for the query. Frankly, I've found that suggestion somewhat amusing if not illogical, but was curious enough to put it to a test.

The two queries above using the "trick" look like this:

  • Query #6 – using no restrictions (i.e. counting all rows);
    select count(1)
     from dbo.Contact
    and
  • Query #7 – using a restriction (i.e. counting rows corresponding to specific criteria).
    select count(1)
     from dbo.Contact
     where (dbo.Contact.FirstName = N'Karl')
Query Performance
(on a heap)
Result
(row count)
CPU Reads
#6 68 5203 19972
#7 76 5203 22

An equal number of reads, obviously, and a slight difference between CPU times compared to previous results for the heap.

Query Performance
(clustered primary key)
Result
(row count)
CPU Reads
#6 64 5208 19972
#7 84 5208 22

Again, an equal number of reads, but this time it seems more processing is required when using the clustered primary key. Is the optimizer translating the constant?

Query Performance
(cl. PK + noncl. FirstName)
Result
(row count)
CPU Reads
#6 2 59 19972
#7 0 2 22

Once more, an equal number of reads, but now it seems less processing is required when the nonclustered index is used.

The results are pretty inconclusive if you ask me, and the execution plans are the same as before:

One thing is certain, though, select count(1) does not perform any better than select count(*). And I've tried using other constants as well – no change. Is that another myth busted? Can anyone confirm this or prove otherwise?

As for me, I'm staying with Asterix. ;) And next time we'll look at another important fact regarding aggregate functions.


ML


*1 FYI: Destroying the primary key in this particular case results in an increased number of reads (5218).

2 comments:

Anonymous said...

Hi Matija,

I believe you accidentally posted a wrong query text for #6 and #7, as they use the * instead of the 1.

The differences you see in CPU time are so small that they are due to normal fluctuations. Run the same query a number of times in a row, and you'll see the same. COUNT(*) and COUNT(1) (or any other constant) will always produce the same query plan, as will COUNT(column that does not allow NULLs). IIRC, this has been confirmed by Microsoft. However, COUNT(column that does allow NULLs) may produce a different (and never faster) plan, since the column has to be checked in order to exclude NULL values from the count.

You also mentioned a slight increase of pages read when converting from heap to clustered index. That is to be expected from a "fresh" heap, since the clustered index needs a few extra pages (root and intermediate level) that the heap does without. However, after some update activity the number of pages used by the heap will have grown significantly beyond that of the clustered index, because heaps hardly (if ever) reuse space left by rows that were removed or moved to a different page after an update of a varing length column.

Best, Hugo

Matija Lah said...

Hugo,

Thank you for the additional info. You actually went to the very edge of spoiling my next post. ;)

The execution plan for Query #7 is the actual execution plan. It seems that with parameterisation the constant is automatically replaced by the asterisk.

Which is actually what I was trying to show here - namely, that there is no difference between the performance of count(*) vs. count(1), even though some people out there believe that there is.


ML