Thursday, September 21, 2006

Date[Time] constructor SQL-style

Querying temporal data in SQL Server has been discussed on many occasions, many tips and how-to's can be found online, best practices regarding date/time data that have evolved over time have been discussed in blogs, articles and books. As the eternal quest in developing practical, simple and efficient solutions regarding temporal data in SQL continues I've decided to chip in and show you a pretty simple example of how datetime values should be queried in SQL Server in order to maximize performance and at the same time how to bridge the gap between business and data store concepts of time.

In most information management solutions keeping and accessing temporal data is vital to the business needs – e.g. auditing, managing time-critical information, tracking the evolution of data, etc. Understanding the concept of time implemented in SQL Server is therefore highly important.

Most tips regarding the querying of date/time data revolve around one highly important aspect – performance. While SQL Server performs best with small data types (e.g. exact numerics, fixed-length character data types (up to a point), date/time values, etc.), it's usually the users who fail to take advantage of this fact. Too often do we see queries like this one; they bring performance down to a halt:

-- #1

use AdventureWorks
go

select *
 from Sales.SalesOrderHeader
 where (year(OrderDate) = 2001)
  and (month(OrderDate) = 7)
go

Fix it before it breaks you

First of all, the dreaded "all-columns-wildcard" has no place in production systems (with the possible exception of the EXISTS expressions, but that's another story). So, the obvious solution is to limit the number of columns in the result set to the ones that we actually need in a particular case – let's focus on the SalesOrderID, Status, CustomerID, SalesPersonID and the TotalDue columns for this example.

-- #2

select SalesOrderID
 ,Status
 ,CustomerID
 ,SalesPersonID
 ,TotalDue
 from Sales.SalesOrderHeader
 where (year(OrderDate) = 2001)
  and (month(OrderDate) = 7)
go

Now, let's optimize the execution of our query by indexing the table. I prefer covering indexes, don't you? ;)

create nonclustered index x_SalesOrderHeader_OrderDate
 on Sales.SalesOrderHeader
  (
  OrderDate
  ,SalesOrderID
  ,CustomerID
  ,SalesPersonID
  )
 include (
  Status
  ,TotalDue
  )
go

What's the result of this improvement? Let's execute query #2 again. 703 logical reads without the covering index vs. 139 with it. Nice, isn't it? It is, but we can do better. A lot better. We just need to replace what *seems* right with something that *is* right – functions on columns in restrictive conditions of a query may correspond to the business requirements perfectly, but they usually fail to correspond to the principles of efficient data management implemented in the data store (as we'll see later on).


What is "July 2001" in T-SQL?

Quite simply, it's the time between July 1st 2001 and July 31st 2001; both boundaries included:

-- #3

select SalesOrderID
 ,Status
 ,CustomerID
 ,SalesPersonID
 ,TotalDue
 from Sales.SalesOrderHeader
 where (OrderDate >= cast('2001-07-01' as datetime))
  and (OrderDate < dateadd(m, 1, cast('2001-07-01' as datetime)))
go

Due to the fact that SQL Server always stores date and time as a single value (look up datetime and smalldatetime in Books Online) the values above are actually implicitly converted at run time to 2001-07-01 00:00:00.000 and 2001-08-01 00:00:00.000 respectively. There are plenty of continuous discussions everywhere on how SQL Server should support separate data types for date and time values, but personally, I see little use in that. Anyway, a month in T-SQL is a time range beginning with the first millisecond of the first day of the month and ending before the first millisecond of the first day of the next month.

Back to our query: suddenly we're down to two logical reads. That's 2 logical reads to get 184 rows from a table containing 31465 rows.

The downside is that – purely for performance reasons – we'd have to propagate SQL-specific thinking up from the data layer: a month would no longer be just a simple value (or two) but a range of values (a set, if you like) – a concept that may just as well be completely irrelevant to the application (and/or the user). Could that be a problem?

The business requirement is simple: "show the data for a specific month", and as far as the solution as a whole is concerned the application logic behind what constitutes a month should not depend on the data layer's concept of a month – and vice versa. By translating the query conditions from the general definition corresponding to the business rule to a more specific definition corresponding to the data store rule we can prevent a conflict of the two concepts. IMHO this is one of the things stored procedures handle perfectly. Some elements of business logic should be implemented in stored procedures, but more importantly – data store-specific functionalities should be kept solely in stored procedures (outside of the application tier, by all means).

Which brings us to the point of this post – a SQL-style date/time constructor could prove useful here.

As is quite obvious from the examples above, the translation of the single business term "July 2001" to a pair of data store-specific terms is what we need – a transition from the logical definition of a temporal quantity to the physical definition of a date/time range.

We'll use two scalar functions for this purpose; they each consume up to three parameters (the values representing the year, the month and the day) and they return a lower and an upper boundary of the date/time range.

create function dbo.Date_LowerBoundary
 (
 @year smallint = null
 ,@month tinyint  = null
 ,@day tinyint  = null
 )
returns datetime
/*
The default value for @year is 1753 (datetime minimum).
*/
as
begin
 declare @minResult datetime
 
 select @minResult
   = case
    when @year between 1753 and 9999
     then cast(@year as char(4))
    else '1753'
    end
   + case
    when @month between 1 and 12
     then right('00' + cast(@month as varchar(2)), 2)
    else '01'
    end
   + case
    when (@month is not null)
     and (@day between 1 and 31)
     then right('00' + cast(@day as varchar(2)), 2)
    else '01'
    end
 
 return @minResult
end
go
create function dbo.Date_UpperBoundary
 (
 @year smallint = null
 ,@month tinyint  = null
 ,@day tinyint  = null
 )
returns datetime
/*
The default value for @year is 9999 (datetime maximum).
*/
as
begin
 declare @minResult datetime
 declare @maxResult datetime
 
 set @year = coalesce(@year, 9999)
 
 select @minResult
   = case
    when @year between 1753 and 9999
     then cast(@year as char(4))
    else '1753'
    end
   + case
    when @month between 1 and 12
     then right('00' + cast(@month as varchar(2)), 2)
    else '01'
    end
   + case
    when (@month is not null)
     and (@day between 1 and 31)
     then right('00' + cast(@day as varchar(2)), 2)
    else '01'
    end
 
 select @maxResult
   = case
    when @year = 9999
     then dateadd(yy, 1, dateadd(ms, -3, @minResult))
    when @month is null
     then dateadd(ms, -3, dateadd(yy, 1, @minResult))
    when @day is null
     then dateadd(ms, -3, dateadd(m, 1, @minResult))
    else dateadd(ms, -3, dateadd(dd, 1, @minResult))
    end
 
 return @maxResult
end
go

The functions provide a layer of abstraction between the business and the data store-specific concept of time. The translation is made in the data access layer; the application should remain oblivious to the concept-shift. The abstraction also allows for less determinism when supplying parameters: if all parameters are unknown (null) then the lowest allowed datetime value is used for the lower boundary and the highest allowed datetime value is used for the upper boundary; only the @year parameter can be specified, or @year and @month together, or all parameters, which gives the highest granularity of the entire SQL Server date/time range and the lowest of a single day.

Warning: the parameters are not validated; the functions assume that the input parameters originate from valid date/time values. A standard system error is raised if the transformed values don't evaluate to datetime.


Actions speak louder than words

Here are two examples of how the two date/time functions could be used.

The functions could be used in-line in the WHERE clause:

-- #4

declare @year smallint
declare @month tinyint
declare @day tinyint

set @year = 2001
set @month = 7

select SalesOrderID
 ,Status
 ,CustomerID
 ,SalesPersonID
 ,TotalDue
 from Sales.SalesOrderHeader
 where (OrderDate
  between dbo.Date_LowerBoundary(@year, @month, @day)
  and dbo.Date_UpperBoundary(@year, @month, @day))
go

The results of the functions could be assigned to variables prior to being used in the WHERE clause:

-- #5

declare @year  smallint
declare @month  tinyint
declare @day  tinyint
declare @lowerBoundary datetime
declare @upperBoundary datetime

set @year = 2001
set @month = 7
set @lowerBoundary = dbo.Date_LowerBoundary(@year, @month, @day)
set @upperBoundary = dbo.Date_UpperBoundary(@year, @month, @day)

select SalesOrderID
 ,Status
 ,CustomerID
 ,SalesPersonID
 ,TotalDue
 from Sales.SalesOrderHeader
 where (OrderDate between @lowerBoundary and @upperBoundary)
go

The upper boundary function returns the last valid datetime value of a given range, not the first valid datetime value of the next range, which means that it can be used with the BETWEEN operator. (And won't break when the year 9999 finally comes ;)

In case you're wondering, BETWEEN is inclusive, which makes

@myDate between @lowerDate and @upperDate

and

@myDate >= @lowerDate and @myDate <= @upperDate

functionally equivalent.


Abstraction costs

Yes, if conceptual simplicity is the goal one should count on a slight performance impact. This is clearly shown in the following datasheet. However, another more important fact is (or should be) far more obvious: using functions on columns in the WHERE clause has a devastating impact on query performance; the reason being that the optimizer can't use (otherwise appropriate) indexes if column values need to be transformed before the conditions can be evaluated.

Query Query performance
No additional indexes * Covering index ** Datetime column index ***
CPU Reads CPU Reads CPU Reads
#1 34 703 14 751 24 715
40 703 20 751 20 882
40 703 10 751 20 673
30 703 20 751 31 673
30 703 10 751 20 673
30 703 10 751 30 673
#2 30 703 18 139 22 893
30 703 10 139 20 898
20 703 20 139 20 898
30 703 10 139 20 898
40 703 20 139 20 884
30 703 30 139 30 886
#3 24 703 0 2 8 644
20 703 0 2 10 766
30 703 0 2 10 614
30 703 0 2 0 614
20 703 0 2 10 614
20 703 0 2 10 614
#4 1730 707 4 7 1675 707
1773 707 0 7 1883 707
1632 707 0 7 1673 707
1722 707 10 7 1613 707
1712 707 0 7 1602 707
1813 707 10 7 1602 707
#5 28 707 2 7 44 707
31 707 0 7
40 707 0 7
10 707 10 7
30 707 0 7
30 707 0 7

The most performant option remains the one used in query #3 where the date/time range is prepared in advance. If 5 additional logical reads seem too much in a given situation, then option #3 is the way to go. On the other hand, the functions discussed in this post do provide a means that could at least prove worthy of consideration.

We've successfully reached the highest possible performance using a composite covering index with the datetime column at the principal position (statistics are kept for the first column of a composite index). Of course not all situations can be catered for with covering indexes – the possible impact on insert/update processes should always be considered. The two functions may still provide good results even without a covering index as long as they're not used in the WHERE clause directly; option #5 should be considered rather than option #4 which is apparently more dependent on the appropriate indexing.


ML


* By default only key columns are indexed in the Sales.SalesOrderHeader table: rowguid, SalesOrderNumber, CustomerID, SalesPersonID, SalesOrderID.

** The covering index spans these columns: OrderDate, SalesOrderID, CustomerID and SalesPersonID, it includes the Status and TotalDue columns.

*** To demonstrate the actual benefit of the covering index I've included the results of a test using a nonclustered index on the OrderDate column:

create nonclustered index x_SalesOrderHeader_OrderDate
 on Sales.SalesOrderHeader
  (
  OrderDate
  )

3 comments:

Anonymous said...

very nice article.
also check this page out for all
kind of datetime goodies :)
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=64762

having a datetime table like that is also awsome because then time periods are as simple as an inner join :)

Anonymous said...

thank you for showing the use of a covered index.
I find it surprising that an index with the ORderDate only does not get used. Is this the case on large datasets as well?
TIA
Henrik Staun Poulsen

Matija Lah said...

While it is true that covering indexes provide the highest performance, you should also be aware of the fact that non-covering indexes nonetheless do allow the query optimizer to build a more efficient execution plan.

A covering index leads to better performance simply because the database engine does not have to read data from the data pages of the data file - all the data is available on the index pages that have to be read anyway when testing the query condition.

On the other hand, a non-covering index that could be used to test the query conditions will be used if the data is selective enough, but the clustered index (e.g. primary key) would still have to be read as well as the data pages in order to retrieve all relevant values. However, the range of the clustered index scan (or seek) operation would be smaller than in the case where an appropriate (covering or non-covering) index would not be available.

Yes, both the size of the data and the selectivity of the indexed columns are highly critical when the execution plan is built. You can find a lot more information in Books Online.