Querying temporal data in SQL Server has been discussed on many occasions, many tips and how-to's can be found online, best practices regarding date/time data that have evolved over time have been discussed in blogs, articles and books. As the eternal quest in developing practical, simple and efficient solutions regarding temporal data in SQL continues I've decided to chip in and show you a pretty simple example of how datetime values should be queried in SQL Server in order to maximize performance and at the same time how to bridge the gap between business and data store concepts of time.
In most information management solutions keeping and accessing temporal data is vital to the business needs – e.g. auditing, managing time-critical information, tracking the evolution of data, etc. Understanding the concept of time implemented in SQL Server is therefore highly important.
Most tips regarding the querying of date/time data revolve around one highly important aspect – performance. While SQL Server performs best with small data types (e.g. exact numerics, fixed-length character data types (up to a point), date/time values, etc.), it's usually the users who fail to take advantage of this fact. Too often do we see queries like this one; they bring performance down to a halt:
-- #1 use AdventureWorks go select * from Sales.SalesOrderHeader where (year(OrderDate) = 2001) and (month(OrderDate) = 7) go
Fix it before it breaks you
First of all, the dreaded "all-columns-wildcard" has no place in production systems (with the possible exception of the EXISTS
expressions, but that's another story). So, the obvious solution is to limit the number of columns in the result set to the ones that we actually need in a particular case – let's focus on the SalesOrderID
, Status
, CustomerID
, SalesPersonID
and the TotalDue
columns for this example.
-- #2 select SalesOrderID ,Status ,CustomerID ,SalesPersonID ,TotalDue from Sales.SalesOrderHeader where (year(OrderDate) = 2001) and (month(OrderDate) = 7) go
Now, let's optimize the execution of our query by indexing the table. I prefer covering indexes, don't you? ;)
create nonclustered index x_SalesOrderHeader_OrderDate on Sales.SalesOrderHeader ( OrderDate ,SalesOrderID ,CustomerID ,SalesPersonID ) include ( Status ,TotalDue ) go
What's the result of this improvement? Let's execute query #2 again. 703 logical reads without the covering index vs. 139 with it. Nice, isn't it? It is, but we can do better. A lot better. We just need to replace what *seems* right with something that *is* right – functions on columns in restrictive conditions of a query may correspond to the business requirements perfectly, but they usually fail to correspond to the principles of efficient data management implemented in the data store (as we'll see later on).
What is "July 2001" in T-SQL?
Quite simply, it's the time between July 1st 2001 and July 31st 2001; both boundaries included:
-- #3 select SalesOrderID ,Status ,CustomerID ,SalesPersonID ,TotalDue from Sales.SalesOrderHeader where (OrderDate >= cast('2001-07-01' as datetime)) and (OrderDate < dateadd(m, 1, cast('2001-07-01' as datetime))) go
Due to the fact that SQL Server always stores date and time as a single value (look up datetime
and smalldatetime
in Books Online) the values above are actually implicitly converted at run time to 2001-07-01 00:00:00.000
and 2001-08-01 00:00:00.000
respectively. There are plenty of continuous discussions everywhere on how SQL Server should support separate data types for date and time values, but personally, I see little use in that. Anyway, a month in T-SQL is a time range beginning with the first millisecond of the first day of the month and ending before the first millisecond of the first day of the next month.
Back to our query: suddenly we're down to two logical reads. That's 2 logical reads to get 184 rows from a table containing 31465 rows.
The downside is that – purely for performance reasons – we'd have to propagate SQL-specific thinking up from the data layer: a month would no longer be just a simple value (or two) but a range of values (a set, if you like) – a concept that may just as well be completely irrelevant to the application (and/or the user). Could that be a problem?
The business requirement is simple: "show the data for a specific month", and as far as the solution as a whole is concerned the application logic behind what constitutes a month should not depend on the data layer's concept of a month – and vice versa. By translating the query conditions from the general definition corresponding to the business rule to a more specific definition corresponding to the data store rule we can prevent a conflict of the two concepts. IMHO this is one of the things stored procedures handle perfectly. Some elements of business logic should be implemented in stored procedures, but more importantly – data store-specific functionalities should be kept solely in stored procedures (outside of the application tier, by all means).
Which brings us to the point of this post – a SQL-style date/time constructor could prove useful here.
As is quite obvious from the examples above, the translation of the single business term "July 2001" to a pair of data store-specific terms is what we need – a transition from the logical definition of a temporal quantity to the physical definition of a date/time range.
We'll use two scalar functions for this purpose; they each consume up to three parameters (the values representing the year, the month and the day) and they return a lower and an upper boundary of the date/time range.
create function dbo.Date_LowerBoundary ( @year smallint = null ,@month tinyint = null ,@day tinyint = null ) returns datetime /* The default value for @year is 1753 (datetime minimum). */ as begin declare @minResult datetime select @minResult = case when @year between 1753 and 9999 then cast(@year as char(4)) else '1753' end + case when @month between 1 and 12 then right('00' + cast(@month as varchar(2)), 2) else '01' end + case when (@month is not null) and (@day between 1 and 31) then right('00' + cast(@day as varchar(2)), 2) else '01' end return @minResult end go
create function dbo.Date_UpperBoundary ( @year smallint = null ,@month tinyint = null ,@day tinyint = null ) returns datetime /* The default value for @year is 9999 (datetime maximum). */ as begin declare @minResult datetime declare @maxResult datetime set @year = coalesce(@year, 9999) select @minResult = case when @year between 1753 and 9999 then cast(@year as char(4)) else '1753' end + case when @month between 1 and 12 then right('00' + cast(@month as varchar(2)), 2) else '01' end + case when (@month is not null) and (@day between 1 and 31) then right('00' + cast(@day as varchar(2)), 2) else '01' end select @maxResult = case when @year = 9999 then dateadd(yy, 1, dateadd(ms, -3, @minResult)) when @month is null then dateadd(ms, -3, dateadd(yy, 1, @minResult)) when @day is null then dateadd(ms, -3, dateadd(m, 1, @minResult)) else dateadd(ms, -3, dateadd(dd, 1, @minResult)) end return @maxResult end go
The functions provide a layer of abstraction between the business and the data store-specific concept of time. The translation is made in the data access layer; the application should remain oblivious to the concept-shift. The abstraction also allows for less determinism when supplying parameters: if all parameters are unknown (null) then the lowest allowed datetime value is used for the lower boundary and the highest allowed datetime value is used for the upper boundary; only the @year
parameter can be specified, or @year
and @month
together, or all parameters, which gives the highest granularity of the entire SQL Server date/time range and the lowest of a single day.
Warning: the parameters are not validated; the functions assume that the input parameters originate from valid date/time values. A standard system error is raised if the transformed values don't evaluate to datetime.
Actions speak louder than words
Here are two examples of how the two date/time functions could be used.
The functions could be used in-line in the WHERE
clause:
-- #4 declare @year smallint declare @month tinyint declare @day tinyint set @year = 2001 set @month = 7 select SalesOrderID ,Status ,CustomerID ,SalesPersonID ,TotalDue from Sales.SalesOrderHeader where (OrderDate between dbo.Date_LowerBoundary(@year, @month, @day) and dbo.Date_UpperBoundary(@year, @month, @day)) go
The results of the functions could be assigned to variables prior to being used in the WHERE
clause:
-- #5 declare @year smallint declare @month tinyint declare @day tinyint declare @lowerBoundary datetime declare @upperBoundary datetime set @year = 2001 set @month = 7 set @lowerBoundary = dbo.Date_LowerBoundary(@year, @month, @day) set @upperBoundary = dbo.Date_UpperBoundary(@year, @month, @day) select SalesOrderID ,Status ,CustomerID ,SalesPersonID ,TotalDue from Sales.SalesOrderHeader where (OrderDate between @lowerBoundary and @upperBoundary) go
The upper boundary function returns the last valid datetime
value of a given range, not the first valid datetime
value of the next range, which means that it can be used with the BETWEEN
operator. (And won't break when the year 9999 finally comes ;)
In case you're wondering, BETWEEN
is inclusive, which makes
@myDate between @lowerDate and @upperDate
and
@myDate >= @lowerDate and @myDate <= @upperDate
functionally equivalent.
Abstraction costs
Yes, if conceptual simplicity is the goal one should count on a slight performance impact. This is clearly shown in the following datasheet. However, another more important fact is (or should be) far more obvious: using functions on columns in the WHERE
clause has a devastating impact on query performance; the reason being that the optimizer can't use (otherwise appropriate) indexes if column values need to be transformed before the conditions can be evaluated.
Query | Query performance | |||||
---|---|---|---|---|---|---|
No additional indexes * | Covering index ** | Datetime column index *** | ||||
CPU | Reads | CPU | Reads | CPU | Reads | |
#1 | 34 | 703 | 14 | 751 | 24 | 715 |
#2 | 30 | 703 | 18 | 139 | 22 | 893 |
#3 | 24 | 703 | 0 | 2 | 8 | 644 |
#4 | 1730 | 707 | 4 | 7 | 1675 | 707 |
#5 | 28 | 707 | 2 | 7 | 44 | 707 |
The most performant option remains the one used in query #3 where the date/time range is prepared in advance. If 5 additional logical reads seem too much in a given situation, then option #3 is the way to go. On the other hand, the functions discussed in this post do provide a means that could at least prove worthy of consideration.
We've successfully reached the highest possible performance using a composite covering index with the datetime
column at the principal position (statistics are kept for the first column of a composite index). Of course not all situations can be catered for with covering indexes – the possible impact on insert/update processes should always be considered. The two functions may still provide good results even without a covering index as long as they're not used in the WHERE
clause directly; option #5 should be considered rather than option #4 which is apparently more dependent on the appropriate indexing.
ML
* By default only key columns are indexed in the Sales.SalesOrderHeader
table: rowguid
, SalesOrderNumber
, CustomerID
, SalesPersonID
, SalesOrderID
.
** The covering index spans these columns: OrderDate
, SalesOrderID
, CustomerID
and SalesPersonID
, it includes the Status
and TotalDue
columns.
*** To demonstrate the actual benefit of the covering index I've included the results of a test using a nonclustered index on the OrderDate
column:
create nonclustered index x_SalesOrderHeader_OrderDate on Sales.SalesOrderHeader ( OrderDate )
3 comments:
very nice article.
also check this page out for all
kind of datetime goodies :)
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=64762
having a datetime table like that is also awsome because then time periods are as simple as an inner join :)
thank you for showing the use of a covered index.
I find it surprising that an index with the ORderDate only does not get used. Is this the case on large datasets as well?
TIA
Henrik Staun Poulsen
While it is true that covering indexes provide the highest performance, you should also be aware of the fact that non-covering indexes nonetheless do allow the query optimizer to build a more efficient execution plan.
A covering index leads to better performance simply because the database engine does not have to read data from the data pages of the data file - all the data is available on the index pages that have to be read anyway when testing the query condition.
On the other hand, a non-covering index that could be used to test the query conditions will be used if the data is selective enough, but the clustered index (e.g. primary key) would still have to be read as well as the data pages in order to retrieve all relevant values. However, the range of the clustered index scan (or seek) operation would be smaller than in the case where an appropriate (covering or non-covering) index would not be available.
Yes, both the size of the data and the selectivity of the indexed columns are highly critical when the execution plan is built. You can find a lot more information in Books Online.
Post a Comment