SyntaxHighlighter

Tuesday, December 3, 2013

Transform a Date Range into a Time Series

It's been a few weeks since I last posted anything because I've relocated to a new company. The transition was smooth but I haven't had time to post. Now, however, after having dived right in, I'm back up and running and writing big queries for big reports.

Yesterday I received a data request asking for a time series dataset, but our management system only stores data in ranges. For any given range of time X, they wanted to see the data for that time period broken into a time series -- in this case, by month. Conceptually, it's similar to an UNPIVOT statement in that it's taking two columns (a start and an end date) and turning it into X number of months as rows.

Here's an example of what the range looks like (table A):

BEGINDATE ENDDATE DATA
2013-01-01 2013-04-04 8

And here's what the output should look like (table Z):

FIRSTDAYOFMONTH DATA
2013-01-01 8
2013-02-01 8
2013-03-01 8
2013-04-01 8

Brilliant. So the result looks somewhat like a table that has been unpivoted, but after careful analysis, it's not the same transformation. What really has to happen is that, given the number of months in the date range, that's how many rows need to be created.

There are a number of facilities for performing this action in T-SQL, given that T-SQL has many iterative features (cursors, while loops, et cetera) so this problem could easily be solved that way. However, it can also be solved in a single query given that SQL Server includes a number of set-based features that can perform this task. You have to understand the problem and use the right tool for the job.

What is the right tool for the job? Well, after more careful analysis, we realize that the core of this problem is creating a variable number of rows in output. Are there any tools in SQL server that will do such a thing? Yes! JOIN keywords are perfectly capable of turning one row into many in a one-to-many relationship. On the left hand side of the join we need a "one" table and in the right hand side of the join we need a "many" table. We have the "one" table -- table A. But we need the "many" table -- table B.

There are several ways to create table B, but the way I prefer is to simply use the ROW_NUMBER window function. Just somehow manage to get a bunch of rows, then number them. Here's a fairly elegant way to do this:

 
 SELECT TOP(1000)
  ROW_NUMBER() OVER( ORDER BY ( SELECT 0 ) ) - 1 AS NUMBER
 FROM
  sys.columns a
 CROSS JOIN 
  sys.columns b
 CROSS JOIN 
  sys.columns c
 CROSS JOIN 
  sys.columns d

It works by creating a huge number of rows and then numbering them. In other databases, such as PostgreSQL, you could theoretically use a function like generate_series(). Or you could create a table valued function in SQL Server that does the same thing. It makes no difference. What does make a difference, however, is the TOP(1000) qualifier. This should be the maximum number of time units you ever expect to see. 1000 months is about 83 years, but 1000 seconds is about 17 minutes.

So now we have table B. All you have to do now is join them together. But wait, table B  returns a bunch of numbers, not months!

The trick, then, is to join on numbers, not months. Does SQL Server have a function for converting months to numbers? In fact it does. DATEDIFF will handle this quite easily.

 SELECT 
  DATEDIFF( MONTH, '2013-01-01', '2013-04-01' )

The result, as you would expect, is 3. There are three months between January and April (and there would be 0 months between January and January of the same year).

So now, you can join on numbers (pseudo-SQL below):

 
 ...

 FROM
  DATA D
 JOIN
  NUMBERS N
 ON
  N.NUMBER = DATEDIFF( MONTH, '2013-01-01', '2013-04-01' )

The problem here, of course, is that only the unique values join together and you end up with one row, so what you ultimately want is a range join, meaning the join can occur on a range of values, in this case, any number between 0 and DATEDIFF(...). Since 0 is implied because ROW_NUMBER starts at 1 (and we subtracted 1 leaving 0 as the starting point).

 
 ...

 FROM
  DATA D
 JOIN
  NUMBERS N
 ON
  N.NUMBER < DATEDIFF( MONTH, '2013-01-01', '2013-04-01' )

So now you understand the concept if I've explained myself correctly. You can composite the joins however you want. I prefer to use a CTE to make the code look cleaner, but you could just as easily use any form of derived query (CTE, subquery, table valued function). Here's my final product:

-- create a CTE (common table expression - a "derived query") for the numbers table
-- NOTE: the numbers table will never change unless you need more than 1000 units of time 
-- (i.e., 1000 months) 
WITH 
 NUMBERS
AS
(
 SELECT TOP(1000)
  ROW_NUMBER() OVER( ORDER BY ( SELECT 0 ) ) - 1 AS number
 FROM
  sys.columns a
 CROSS JOIN 
  sys.columns b
 CROSS JOIN 
  sys.columns c
 CROSS JOIN 
  sys.columns d 
),

-- create a CTE (common table expression - a "derived query") for the data
 DATA
AS
(
 SELECT
  '2013-01-01' AS BEGINDATE,
  '2013-04-01' AS ENDDATE,
  8 AS DATA          
)

-- work with the derived queries as if they were tables (whose aliases are NUMBERS and DATA)
SELECT

 -- Get the first day of the month with a convoluted calculation -- this would be much easier
 -- in SQL Server 2012 because there is an EOMONTH function
 CAST( DATEADD( MONTH, N.NUMBER, D.BEGINDATE ) AS DATE ) AS FIRSTDAYOFMONTH,
 D.DATA
FROM
 DATA D
JOIN   
 NUMBERS N

-- this is where the magic happens ... join on the numbers table which is simply a sequence
-- of numbers from 0-999 with the difference between the dates in terms of months.
-- So, the difference between January and March is (3-1) = 2, so the join will occur 
-- between 0 and 2, that is 0, 1, and 2 -- thereby creating 3 rows, one for each month
ON 
 N.NUMBER <= DATEDIFF( MONTH, D.BEGINDATE, D.ENDDATE )

-- this is just limiting the query so that you won't see months after the current
-- month; remove this for projections into the future
WHERE
 D.ENDDATE < GETDATE()
ORDER  BY
 FIRSTDAYOFMONTH

Update: Some ideas for this post came from SQL how to convert row with date range to many rows with each date.

No comments:

Post a Comment