SyntaxHighlighter

Wednesday, August 21, 2013

SQL Server and the Quest for the Intermediate Materialization Query Hint

The Adventure

I feel like I'm Indiana Jones and I'm hunting for a lost treasure: intermediate materialization. It seems like I do that a lot in SQL Server -- I'm probably on my 5th or 6th sequel by now (get it? SQL? SEQUEL? sequel?). Most of the time there is some kind of awful kluge to workaround the SQL Server team's stubborness. They wrote SQL Server that way, and it's just fine the way it is!

An example that's unrelated: paging results. In the old days, paging results was a nightmare that had to be done with the TOP keyword, sorting, and several subqueries. I shudder for those days. Then the window functions came and ROW_NUMBER() saved the day. Finally, in SQL Server 2012, Microsoft's dev team gave in and we got the OFFSET-FETCH keywords. (Funny, since MySQL and PostgreSQL have been making our lives easier with such keywords as LIMIT and OFFSET for ... decades?)

But one thing that doesn't seem to be on the table yet is (dun dun duuuun!) intermediate materialization. Sure, Microsoft has said they'll [try to think about maybe] support a query hint in the future for intermediate materialization, but that request was made in 2006 (!!!!!) and still hasn't arrived yet.

The Problem

Let's say you're diving into execution plans pretty deeply. You find there are some problems inherent to SQL Server's query optimizer -- it just doesn't do what you want it to do. 

You find that you can run some part of your query in a new connection window (that will later become a subquery). It runs in 3 seconds instead of 90 minutes. Voila! You have the answer. All you have to do now is to integrate the subquery into the main query. 

The problem starts after you stick your standalone query into a subquery in your view and you join it to the main query. Then you run it and you realize that it's now running at 90 minutes again! How frustrating, but why?

The reason, you discover after analyzing the query plan, is because the query optimizer has rewritten your subquery into a series of joins, undoing all the hard work you did by creating a subquery in the first place (and making your query run 10000% longer in the process).

You know there are such things as query hints, so you go off looking for a query hint to solve this problem. The problem is, you don't know what this problem is called. You search for subquery optimization in SQL Server and stop SQL Server from optimizing my subquery and SQL Server is rewriting subqueries and on and on until your fingers get numb. Then you discover that subqueries are sometimes called virtual queries so you search for SQL Server rewrites virtual queries. At some point you stumble upon some articles relating to materialized queries and you remember that you've heard those words before but you don't know what they mean because your database professor covered that topic on the Monday morning after the big football game. 

What is Materialization?

In databases, materialization is the process of gathering data from physical sources and creating results in the form of a relation (a table). It contains a snapshot of data necessary at that moment in time. Any query or view that produces results is referred to as materialized. Wikipedia has a good explanation of a materialized view.

It seems abstract now, but it's really a simple concept. It's like arrays or objects. Eventually you'll just get it. I suggest you try really hard to understand it now so that the learning will come faster because this is a really useful concept in your database life. It's like playing Super Mario Brothers. You just couldn't get past level 8-4 the night you tried for 6 hours, but the next morning after a good night's sleep, you got through it in one try.

What is Intermediate Materialization? 

Intermediate materialization is when query results must be stored temporarily before processing the remainder of the query. In other words, SQL Server has to stop joining pre-existing data sources, create a temporary table, and then join to that temporary table. 

Let me ask you a question? Have you ever created a temporary table from a SELECT statement and then joined that temporary table to another table? If your answer is yes, then you have just carried out the poor man's version of intermediate materialization

The problem is that SQL Server does not support intermediate materialization ... And in fact, if other bloggers are to believed, the SQL Server dev team is actively trying to make it harder for us to use intermediate results in a single query. [aside: I have a theory that the reason Microsoft doesn't want to add this query hint is because it would create a non-relational phase in the logical processing steps of executing queries.]

Solution

The old fashioned way to do this was to use TOP 100 PERCENT but evidently that went away in SQL Server 2005. As Adam Machanic suggests, using TOP ( 2147483647 ) but that was for SQL Server 2005, and who knows if it works in 2008 R2 or 2012?

Update 2013.08.28: Yes, it does seem to work in 2008 R2. Your mileage may vary. 

Another user on stackoverflow suggests using the CASE statement (and might I suggest that you should probably use it on your join keys) and I've had some luck with that in the past. Keep in mind, however, that if you're joining millions of rows that each one of them has to be cast and processed as an expression and that could slow your query down in its own right.

Example

I had a query that was taking 80 minutes. I tracked its duration over time and it was slowly increasing. I optimized the subquery so that it took 5 minutes but in the entire query the subquery was being optimized as a series of joins. After tricking SQL Server into intermediate materialization, I was able to produce the query results consistently in about 6 minutes. This chart demonstrates the massive savings I was able to get by using intermediate materialization. Your mileage may vary!

No comments:

Post a Comment