Counting – what's it all about? I'm sure you've heard about a trick that one should use when aggregating data using the COUNT
system function in Microsoft SQL Server; specifically, how the performance of the function can be improved using a neat trick. But can we go beyond neat...?
The tests have been done on a SQL Server 2005 (SP2) instance; you're all kindly invited to also run them on an instance/version of your own choice.
Send in the rows
Let's start with some test data. I've provided a script that will do that for you, provided you have a testing database – which I'm sure you do (yes, tempdb is just fine).
The script creates a five column table and copies data into it from the AdventureWorks database, which I assume is available on the same server instance. If any of the above is not true in your particular case, you can still adapt the script as needed.
We end up with a brand new table containing 19,972 rows of data residing in 5,197 data pages.
The plan
We'll be counting rows (what a shock):
- Query #1 – using no restrictions (i.e. counting all rows);
select count(*) from dbo.Contact
and - Query #2 – using a restriction (i.e. counting rows corresponding to specific criteria).
select count(*) from dbo.Contact where (dbo.Contact.FirstName = N'Karl')
Get on with it!
If we execute the two queries above on our heap (remember, this new table has no keys and no indexes), we get these results:
Query | Performance (on a heap) |
Result (row count) |
|
---|---|---|---|
CPU | Reads | ||
#1 | 76 | 5203 | 19972 |
#2 | 62 | 5203 | 22 |
Also take a look at the execution plans (Query #1 execution plan, Query #2 execution plan). Nothing too impressive, but understandable: without any indexes, the optimizer needs to access all the pages in order to get the actual row count. With Query #1 the optimizer has no options but to do a full table scan, and the restriction in Query #2 can't really improve the performance as it requires data to be tested in order to restrict the result set.
Now, let's add the primary key. We'll use the ContactID
column and also make the primary key clustered (which is the default anyway, but I prefer explicit declarations):
alter table dbo.Contact add constraint pk_Contact primary key clustered ( ContactID )
Query | Performance (clustered primary key) |
Result (row count) |
|
---|---|---|---|
CPU | Reads | ||
#1 | 34 | 5208 | 19972 |
#2 | 68 | 5208 | 22 |
Ha! The number of reads has actually increased. Note that the clustered primary key actually takes up as many pages as the data itself. On the other hand, for Query #1 the stress on the CPU has decreased, while for Query #2 it seems that the clustered primary key can't really do much – data must be read and tested, and the referenced column is not part of the primary key.
Do you think making it part of the clustered primary key might make a difference? Do you think it would make a favorable difference? I've decided on not including the latter in this test as it affects the data model, and modifying the data model (especially, changing a primary key) just to improve performance is IMO far from being good practice. If you're curious enough you can do it yourself.1
Anyway, as expected (and quite imminent as well), the execution plans are different now (Query #1 execution plan, Query #2 execution plan) compared to before.
Let's see what happens if we add an index. Let's index the column used in the restriction of Query #2 – the FirstName
column:
create nonclustered index x_Contact_FirstName on dbo.Contact ( FirstName )
Query | Performance (cl. PK + noncl. FirstName) |
Result (row count) |
|
---|---|---|---|
CPU | Reads | ||
#1 | 4 | 59 | 19972 |
#2 | 0 | 2 | 22 |
Well, well, who'd have thought...? The optimizer once again stands up to its name. Let's examine the execution plans (Query #1 execution plan, Query #2 execution plan). No longer is the use of the clustered primary key the optimal choice – the nonclustered index requires far less reading. After all, it resides on only 57 pages, and the optimizer takes full advantage of this fact – on both occasions.
Some Romans fear Asterix...
..others fear the asterisk like the plague. However, if we consider the fact that
select count(*)
simply is not in any way functionally equivalent to
select *
then this fear loses most of its substance. I will not go into any details this time of when and why not declaring columns in the SELECT
clause is poor practice.
For comparison let's execute these two queries:
- Query #4 – retrieving all rows (and all columns);
select * from dbo.Contact
and - Query #5 – retrieving a subset of rows.
select * from dbo.Contact where (dbo.Contact.FirstName = N'Karl')
The execution plans:
- Heap: Query #4 execution plan, Query #5 execution plan ;
- Clustered primary key: Query #4 execution plan, Query #5 execution plan ;
- Clustered primary key and nonclustered index on FirstName: Query #4 execution plan, Query #5 execution plan .
If we now compare the execution plans of these last two queries with the execution plans above, we can observe one significant difference. When viewing the execution plan in SSMS park the mouse over the arrows leading from the very first operator (Table Scan or Clustered/Nonclustered Index Scan/Seek). What can we observe?
Simply counting rows does not require any data to be read from the data pages (except when values need to be tested in order to apply query restrictions).
What about that "neat trick"?
A proposed method of speeding up row counts that has been suggested by some members of the community from time to time, is to replace the asterisk as the argument of the COUNT
function with a constant. This – believe it or not – is supposed to lower the I/O for the query. Frankly, I've found that suggestion somewhat amusing if not illogical, but was curious enough to put it to a test.
The two queries above using the "trick" look like this:
- Query #6 – using no restrictions (i.e. counting all rows);
select count(1) from dbo.Contact
and - Query #7 – using a restriction (i.e. counting rows corresponding to specific criteria).
select count(1) from dbo.Contact where (dbo.Contact.FirstName = N'Karl')
Query | Performance (on a heap) |
Result (row count) |
|
---|---|---|---|
CPU | Reads | ||
#6 | 68 | 5203 | 19972 |
#7 | 76 | 5203 | 22 |
An equal number of reads, obviously, and a slight difference between CPU times compared to previous results for the heap.
Query | Performance (clustered primary key) |
Result (row count) |
|
---|---|---|---|
CPU | Reads | ||
#6 | 64 | 5208 | 19972 |
#7 | 84 | 5208 | 22 |
Again, an equal number of reads, but this time it seems more processing is required when using the clustered primary key. Is the optimizer translating the constant?
Query | Performance (cl. PK + noncl. FirstName) |
Result (row count) |
|
---|---|---|---|
CPU | Reads | ||
#6 | 2 | 59 | 19972 |
#7 | 0 | 2 | 22 |
Once more, an equal number of reads, but now it seems less processing is required when the nonclustered index is used.
The results are pretty inconclusive if you ask me, and the execution plans are the same as before:
- Heap: Query #6 execution plan, Query #7 execution plan ;
- Clustered primary key: Query #6 execution plan, Query #7 execution plan ;
- Clustered primary key and nonclustered index on FirstName: Query #6 execution plan, Query #7 execution plan .
One thing is certain, though, select count(1)
does not perform any better than select count(*)
. And I've tried using other constants as well – no change. Is that another myth busted? Can anyone confirm this or prove otherwise?
As for me, I'm staying with Asterix. ;) And next time we'll look at another important fact regarding aggregate functions.
ML
*1 FYI: Destroying the primary key in this particular case results in an increased number of reads (5218).